15
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Analysis of error profiles in deep next-generation sequencing data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Sequencing errors are key confounding factors for detecting low-frequency genetic variants that are important for cancer molecular diagnosis, treatment, and surveillance using deep next-generation sequencing (NGS). However, there is a lack of comprehensive understanding of errors introduced at various steps of a conventional NGS workflow, such as sample handling, library preparation, PCR enrichment, and sequencing. In this study, we use current NGS technology to systematically investigate these questions.

          Results

          By evaluating read-specific error distributions, we discover that the substitution error rate can be computationally suppressed to 10 −5 to 10 −4, which is 10- to 100-fold lower than generally considered achievable (10 −3) in the current literature. We then quantify substitution errors attributable to sample handling, library preparation, enrichment PCR, and sequencing by using multiple deep sequencing datasets. We find that error rates differ by nucleotide substitution types, ranging from 10 −5 for A>C/T>G, C>A/G>T, and C>G/G>C changes to 10 −4 for A>G/T>C changes. Furthermore, C>T/G>A errors exhibit strong sequence context dependency, sample-specific effects dominate elevated C>A/G>T errors, and target-enrichment PCR led to ~ 6-fold increase of overall error rate. We also find that more than 70% of hotspot variants can be detected at 0.1 ~ 0.01% frequency with the current NGS technology by applying in silico error suppression.

          Conclusions

          We present the first comprehensive analysis of sequencing error sources in conventional NGS workflows. The error profiles revealed by our study highlight new directions for further improving NGS analysis accuracy both experimentally and computationally, ultimately enhancing the precision of deep sequencing.

          Electronic supplementary material

          The online version of this article (10.1186/s13059-019-1659-6) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references38

          • Record: found
          • Abstract: found
          • Article: not found

          The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

          Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS--the 1000 Genome pilot alone includes nearly five terabases--make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Signatures of mutational processes in human cancer

            All cancers are caused by somatic mutations. However, understanding of the biological processes generating these mutations is limited. The catalogue of somatic mutations from a cancer genome bears the signatures of the mutational processes that have been operative. Here, we analysed 4,938,362 mutations from 7,042 cancers and extracted more than 20 distinct mutational signatures. Some are present in many cancer types, notably a signature attributed to the APOBEC family of cytidine deaminases, whereas others are confined to a single class. Certain signatures are associated with age of the patient at cancer diagnosis, known mutagenic exposures or defects in DNA maintenance, but many are of cryptic origin. In addition to these genome-wide mutational signatures, hypermutation localized to small genomic regions, kataegis, is found in many cancer types. The results reveal the diversity of mutational processes underlying the development of cancer with potential implications for understanding of cancer etiology, prevention and therapy.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Analysis of protein-coding genetic variation in 60,706 humans

              Summary Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. We describe the aggregation and analysis of high-quality exome (protein-coding region) sequence data for 60,706 individuals of diverse ethnicities generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of truncating variants with 72% having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human “knockout” variants in protein-coding genes.
                Bookmark

                Author and article information

                Contributors
                Xiaotu.Ma@stjude.org
                Ying.Shao@stjude.org
                Liqing.Tian@stjude.org
                diane.flasch@stjude.org
                heather.mulder@stjude.org
                Michael.Edmonson@stjude.org
                liuyu@scmc.com.cn
                Xiang.Chen@stjude.org
                Scott.Newman@stjude.org
                Joy.Nakitandwe@stjude.org
                yongjin.li@merck.com
                leebenshang@hotmail.com
                sshfranks@126.com
                Zhaoming.Wang@stjude.org
                Sheila.Shurtleff@stjude.org
                Les.Robison@stjude.org
                slevy@hudsonalpha.org
                John.Easton@stjude.org
                Jinghui.Zhang@stjude.org
                Journal
                Genome Biol
                Genome Biol
                Genome Biology
                BioMed Central (London )
                1474-7596
                1474-760X
                14 March 2019
                14 March 2019
                2019
                : 20
                : 50
                Affiliations
                [1 ]ISNI 0000 0001 0224 711X, GRID grid.240871.8, Department of Computational Biology, , St. Jude Children’s Research Hospital, ; Memphis, TN 38105 USA
                [2 ]ISNI 0000 0001 0224 711X, GRID grid.240871.8, Department of Pathology, , St. Jude Children’s Research Hospital, ; Memphis, TN 38105 USA
                [3 ]ISNI 0000 0004 0368 8293, GRID grid.16821.3c, Key Laboratory of Pediatric Hematology and Oncology Ministry of Health, Department of Hematology and Oncology, Shanghai Children’s Medical Center, , Shanghai Jiao Tong University School of Medicine, ; Shanghai, 200127 China
                [4 ]ISNI 0000 0001 0224 711X, GRID grid.240871.8, Department of Epidemiology and Cancer Control, , St. Jude Children’s Research Hospital, ; Memphis, TN 38105 USA
                [5 ]ISNI 0000 0004 0408 3720, GRID grid.417691.c, HudsonAlpha Institute for Biotechnology, ; Huntsville, AL 35806 USA
                Author information
                http://orcid.org/0000-0002-6233-2145
                Article
                1659
                10.1186/s13059-019-1659-6
                6417284
                30867008
                ce253a2d-1408-4bc8-91a3-d87654c5c549
                © The Author(s). 2019

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 14 December 2018
                : 19 February 2019
                Categories
                Research
                Custom metadata
                © The Author(s) 2019

                Genetics
                deep sequencing,error rate,substitution,subclonal,detection,hotspot mutation
                Genetics
                deep sequencing, error rate, substitution, subclonal, detection, hotspot mutation

                Comments

                Comment on this article