12
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Wide-Open: Accelerating public data release by automating detection of overdue datasets

      other
      1 , * , 2 , 1 , 3
      PLoS Biology
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

          Related collections

          Most cited references7

          • Record: found
          • Abstract: found
          • Article: not found

          The Gene Expression Omnibus Database.

          The Gene Expression Omnibus (GEO) database is an international public repository that archives and freely distributes high-throughput gene expression and other functional genomics data sets. Created in 2000 as a worldwide resource for gene expression studies, GEO has evolved with rapidly changing technologies and now accepts high-throughput data for many other data applications, including those that examine genome methylation, chromatin structure, and genome-protein interactions. GEO supports community-derived reporting standards that specify provision of several critical study elements including raw data, processed data, and descriptive metadata. The database not only provides access to data for tens of thousands of studies, but also offers various Web-based tools and strategies that enable users to locate data relevant to their specific interests, as well as to visualize and analyze the data. This chapter includes detailed descriptions of methods to query and download GEO data and use the analysis and visualization tools. The GEO homepage is at http://www.ncbi.nlm.nih.gov/geo/.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Reuse of public genome-wide gene expression data.

            Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments - microarrays and next-generation sequencing - have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus

              The NCBI Gene Expression Omnibus (GEO) represents the largest public repository of microarray data. However, finding data in GEO can be challenging. We have developed GEOmetadb in an attempt to make querying the GEO metadata both easier and more powerful. All GEO metadata records as well as the relationships between them are parsed and stored in a local MySQL database. A powerful, flexible web search interface with several convenient utilities provides query capabilities not available via NCBI tools. In addition, a Bioconductor package, GEOmetadb that utilizes a SQLite export of the entire GEOmetadb database is also available, rendering the entire GEO database accessible with full power of SQL-based queries from within R. Availability: The web interface and SQLite databases available at http://gbnci.abcc.ncifcrf.gov/geo/. The Bioconductor package is available via the Bioconductor project. The corresponding MATLAB implementation is also available at the same website. Contact: yidong@mail.nih.gov
                Bookmark

                Author and article information

                Journal
                PLoS Biol
                PLoS Biol
                plos
                plosbiol
                PLoS Biology
                Public Library of Science (San Francisco, CA USA )
                1544-9173
                1545-7885
                8 June 2017
                June 2017
                8 June 2017
                : 15
                : 6
                : e2002477
                Affiliations
                [1 ]Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, Washington, United States of America
                [2 ]Microsoft Research, Redmond, Washington, United States of America
                [3 ]Information School, University of Washington, Seattle, Washington, United States of America
                Author notes

                The authors have declared that no competing interests exist.

                Author information
                https://orcid.org/http://orcid.org/0000-0003-2265-3881
                Article
                pbio.2002477
                10.1371/journal.pbio.2002477
                5464523
                28594819
                c596ccb0-3a93-4e98-a588-d3816e0b76a0
                © 2017 Grechkin et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                Page count
                Figures: 3, Tables: 0, Pages: 5
                Product
                Funding
                National Science Foundation BIGDATA https://www.nsf.gov/ (grant number 1247469). Received by BH and MG. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Alfred P. Sloan Foundation https://sloan.org/ (grant number 3835). Through the Data Science Environments program. Received by BH and MG. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. University of Washington Information School https://ischool.uw.edu/. Received by BH. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Gordon and Betty Moore Foundation https://www.moore.org/ (grant number 2013-10-29). Received by BH and MG. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Community Page
                Biology and Life Sciences
                Genetics
                Gene Expression
                Biology and Life Sciences
                Biotechnology
                Computer and Information Sciences
                Computer Applications
                Web-Based Applications
                Research and Analysis Methods
                Database and Informatics Methods
                Biological Databases
                Sequence Databases
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Databases
                Computer and Information Sciences
                Information Technology
                Text Mining
                Research and Analysis Methods
                Research Facilities
                Information Centers
                Archives
                Science Policy
                Open Science
                Open Data
                Science Policy
                Open Science
                Custom metadata
                All processed data are within the paper and its Supporting Information files. Full texts of processed papers are available through PubMedCentral OA ( https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/).

                Life sciences
                Life sciences

                Comments

                Comment on this article