66
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Quality control (QC) and preprocessing are essential steps for sequencing data analysis to ensure the accuracy of results. However, existing tools cannot provide a satisfying solution with integrated comprehensive functions, proper architectures, and highly scalable acceleration. In this article, we demonstrate SOAPnuke as a tool with abundant functions for a “QC-Preprocess-QC” workflow and MapReduce acceleration framework. Four modules with different preprocessing functions are designed for processing datasets from genomic, small RNA, Digital Gene Expression, and metagenomic experiments, respectively. As a workflow-like tool, SOAPnuke centralizes processing functions into 1 executable and predefines their order to avoid the necessity of reformatting different files when switching tools. Furthermore, the MapReduce framework enables large scalability to distribute all the processing works to an entire compute cluster.

          We conducted a benchmarking where SOAPnuke and other tools are used to preprocess a ∼30× NA12878 dataset published by GIAB. The standalone operation of SOAPnuke struck a balance between resource occupancy and performance. When accelerated on 16 working nodes with MapReduce, SOAPnuke achieved ∼5.7 times the fastest speed of other tools.

          Related collections

          Most cited references24

          • Record: found
          • Abstract: not found
          • Book: not found

          R: A language and environment for statistical computing

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences

            Increased reliance on computational approaches in the life sciences has revealed grave concerns about how accessible and reproducible computation-reliant results truly are. Galaxy http://usegalaxy.org, an open web-based platform for genomic research, addresses these problems. Galaxy automatically tracks and manages data provenance and provides support for capturing the context and intent of computational methods. Galaxy Pages are interactive, web-based documents that provide users with a medium to communicate a complete computational analysis.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Extensive sequencing of seven human genomes to characterize benchmark reference materials

              The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCode WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.
                Bookmark

                Author and article information

                Journal
                Gigascience
                Gigascience
                gigascience
                GigaScience
                Oxford University Press
                2047-217X
                January 2018
                04 December 2017
                04 December 2017
                : 7
                : 1
                : 1-6
                Affiliations
                [1]BGI-Shenzhen, Shenzhen 518083
                [2]Geneplus-Beijing, Beijing 102206
                [3]Department of Oncology, Fujian Medical University Union Hospital, Fuzhou 350001
                [4]Fujian Key Laboratory of Translational Cancer Medicine, Fuzhou 350014
                [5]Department of Stem Cell Research Institute, Fujian Medical University Stem Cell Research Institute, Fuzhou 350000
                [6]Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073
                [7]Intel China Ltd., Shanghai 200336
                [8]Guangdong Provincial Hospital of Chinese Medicine, Guangzhou 510120
                [9]Department of Surgery, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong
                [10]James D. Watson Institute of Genome Sciences, Hangzhou 310058, China
                Author notes
                Correspondence address. Lin Fang, BGI-Shenzhen, Shenzhen 518083; Tel: +86-755-36307888; Fax: +86-755-36307273; E-mail: fangl@ 123456genomics.cn
                Correspondence address. Qiang Chen, Department of Oncology, Fujian Medical University Union Hospital, Fuzhou 350001; E-mail: cqiang8@ 123456189.cn

                Equal contribution.

                Author information
                http://orcid.org/0000-0002-9246-1829
                http://orcid.org/0000-0002-2750-6517
                http://orcid.org/0000-0002-6864-5644
                http://orcid.org/0000-0002-0858-3410
                http://orcid.org/0000-0002-5954-3435
                Article
                gix120
                10.1093/gigascience/gix120
                5788068
                29220494
                5d8a283c-f441-45aa-9d85-0fff10749107
                © The Author(s) 2017. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 17 July 2017
                : 18 October 2017
                : 22 November 2017
                Page count
                Pages: 6
                Categories
                Technical Note

                high-throughput sequencing,quality control,preprocessing,mapreduce

                Comments

                Comment on this article