1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Sequencing error profiles of Illumina sequencing instruments

      research-article
      ,
      NAR Genomics and Bioinformatics
      Oxford University Press

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Sequencing technology has achieved great advances in the past decade. Studies have previously shown the quality of specific instruments in controlled conditions. Here, we developed a method able to retroactively determine the error rate of most public sequencing datasets. To do this, we utilized the overlaps between reads that are a feature of many sequencing libraries. With this method, we surveyed 1943 different datasets from seven different sequencing instruments produced by Illumina. We show that among public datasets, the more expensive platforms like HiSeq and NovaSeq have a lower error rate and less variation. But we also discovered that there is great variation within each platform, with the accuracy of a sequencing experiment depending greatly on the experimenter. We show the importance of sequence context, especially the phenomenon where preceding bases bias the following bases toward the same identity. We also show the difference in patterns of sequence bias between instruments. Contrary to expectations based on the underlying chemistry, HiSeq X Ten and NovaSeq 6000 share notable exceptions to the preceding-base bias. Our results demonstrate the importance of the specific circumstances of every sequencing experiment, and the importance of evaluating the quality of each one.

          Related collections

          Most cited references19

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The Sequence Alignment/Map format and SAMtools

          Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.

            R. Edgar (2002)
            The Gene Expression Omnibus (GEO) project was initiated in response to the growing demand for a public repository for high-throughput gene expression data. GEO provides a flexible and open design that facilitates submission, storage and retrieval of heterogeneous data sets from high-throughput gene expression and genomic hybridization experiments. GEO is not intended to replace in house gene expression databases that benefit from coherent data sets, and which are constructed to facilitate a particular analytic method, but rather complement these by acting as a tertiary, central data distribution hub. The three central data entities of GEO are platforms, samples and series, and were designed with gene expression and genomic hybridization experiments in mind. A platform is, essentially, a list of probes that define what set of molecules may be detected. A sample describes the set of molecules that are being probed and references a single platform used to generate its molecular abundance data. A series organizes samples into the meaningful data sets which make up an experiment. The GEO repository is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Sequence-specific error profile of Illumina sequencers

              We identified the sequence-specific starting positions of consecutive miscalls in the mapping of reads obtained from the Illumina Genome Analyser (GA). Detailed analysis of the miscall pattern indicated that the underlying mechanism involves sequence-specific interference of the base elongation process during sequencing. The two major sequence patterns that trigger this sequence-specific error (SSE) are: (i) inverted repeats and (ii) GGC sequences. We speculate that these sequences favor dephasing by inhibiting single-base elongation, by: (i) folding single-stranded DNA and (ii) altering enzyme preference. This phenomenon is a major cause of sequence coverage variability and of the unfavorable bias observed for population-targeted methods such as RNA-seq and ChIP-seq. Moreover, SSE is a potential cause of false single-nucleotide polymorphism (SNP) calls and also significantly hinders de novo assembly. This article highlights the importance of recognizing SSE and its underlying mechanisms in the hope of enhancing the potential usefulness of the Illumina sequencers.
                Bookmark

                Author and article information

                Contributors
                Journal
                NAR Genom Bioinform
                NAR Genom Bioinform
                nargab
                NAR Genomics and Bioinformatics
                Oxford University Press
                2631-9268
                March 2021
                27 March 2021
                27 March 2021
                : 3
                : 1
                : lqab019
                Affiliations
                Graduate Program in Bioinformatics and Genomics, The Huck Institutes for Life Sciences, The Pennsylvania State University , University Park, PA 16802, USA
                Department of Biochemistry and Molecular Biology, The Pennsylvania State University , University Park, PA 16802, USA
                Author notes
                To whom correspondence should be addressed. Tel: +1 814 865 4752; Email: anton@ 123456nekrut.org

                The Galaxy Project, https://galaxyproject.org.

                Author information
                https://orcid.org/0000-0002-5987-8032
                Article
                lqab019
                10.1093/nargab/lqab019
                8002175
                33817639
                21d4aa69-7f9d-448d-af0a-7c424e7c4b6d
                © The Author(s) 2021. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 05 June 2020
                : 01 February 2021
                : 16 March 2021
                Page count
                Pages: 9
                Funding
                Funded by: NHGRI, DOI 10.13039/100000051;
                Award ID: U41 HG006620
                Funded by: NSF ABI Grant;
                Award ID: 1661497
                Funded by: NIAID, DOI 10.13039/100000060;
                Award ID: R01 AI134384
                Categories
                AcademicSubjects/SCI00030
                AcademicSubjects/SCI00980
                AcademicSubjects/SCI01060
                AcademicSubjects/SCI01140
                AcademicSubjects/SCI01180
                Methart

                Comments

                Comment on this article