+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The reuse of public datasets in the life sciences: potential risks and rewards


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          The ‘big data’ revolution has enabled novel types of analyses in the life sciences, facilitated by public sharing and reuse of datasets. Here, we review the prodigious potential of reusing publicly available datasets and the associated challenges, limitations and risks. Possible solutions to issues and research integrity considerations are also discussed. Due to the prominence, abundance and wide distribution of sequencing data, we focus on the reuse of publicly available sequence datasets. We define ‘successful reuse’ as the use of previously published data to enable novel scientific findings. By using selected examples of successful reuse from different disciplines, we illustrate the enormous potential of the practice, while acknowledging the respective limitations and risks. A checklist to determine the reuse value and potential of a particular dataset is also provided. The open discussion of data reuse and the establishment of this practice as a norm has the potential to benefit all stakeholders in the life sciences.

          Related collections

          Most cited references92

          • Record: found
          • Abstract: found
          • Article: not found

          JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

          The analysis of regulatory regions in genome sequences is strongly based on the detection of potential transcription factor binding sites. The preferred models for representation of transcription factor binding specificity have been termed position-specific scoring matrices. JASPAR is an open-access database of annotated, high-quality, matrix-based transcription factor binding site profiles for multicellular eukaryotes. The profiles were derived exclusively from sets of nucleotide sequences experimentally demonstrated to bind transcription factors. The database is complemented by a web interface for browsing, searching and subset selection, an online sequence analysis utility and a suite of programming tools for genome-wide and comparative genomic analysis of regulatory regions. JASPAR is available at http://jaspar. cgb.ki.se.
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets

            High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets ( 150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/.
              • Record: found
              • Abstract: not found
              • Article: not found

              Big data and the future of ecology


                Author and article information

                PeerJ Inc. (San Diego, USA )
                22 September 2020
                : 8
                : e9954
                [1 ]Genetics and Genomics of Plants, Center for Biotechnology (CeBiTec) & Faculty of Biology, Bielefeld University , Bielefeld, Germany
                [2 ]Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University , Bielefeld, Germany
                [3 ]Current Affiliation: Intercollege Graduate Degree Program in Plant Biology, Penn State University , University Park, State College, PA, United States of America
                [4 ]Evolution and Diversity, Department of Plant Sciences, University of Cambridge , Cambridge, United Kingdom
                Author information
                © 2020 Sielemann et al.

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

                : 30 April 2020
                : 25 August 2020
                Funded by: Deutsche Forschungsgemeinschaft
                Funded by: Bielefeld University
                Funded by: St. Catharine’s College, University of Cambridge
                Support for the Article Processing Charge is provided by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University. K.S. is funded by Bielefeld University. A.H. received the 2018 Richard Hardy Award (St. Catharine’s College, University of Cambridge) which partly supported an internship at Bielefeld University, leading to this collaboration. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Computational Biology
                Computational Science
                Data Science

                reuse,data science,sequencing data,genomics,bioinformatics,databases,computational biology,open science


                Comment on this article