57
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Identification and correction of systematic error in high-throughput sequence data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed "next-gen" sequencing technologies have greatly reduced the cost of sequencing, but have been shown to be more error prone than previous technologies. Both position specific (depending on the location in the read) and sequence specific (depending on the sequence in the read) errors have been identified in Illumina and Life Technology sequencing platforms. We describe a new type of systematic error that manifests as statistically unlikely accumulations of errors at specific genome (or transcriptome) locations.

          Results

          We characterize and describe systematic errors using overlapping paired reads from high-coverage data. We show that such errors occur in approximately 1 in 1000 base pairs, and that they are highly replicable across experiments. We identify motifs that are frequent at systematic error sites, and describe a classifier that distinguishes heterozygous sites from systematic error. Our classifier is designed to accommodate data from experiments in which the allele frequencies at heterozygous sites are not necessarily 0.5 (such as in the case of RNA-Seq), and can be used with single-end datasets.

          Conclusions

          Systematic errors can easily be mistaken for heterozygous sites in individuals, or for SNPs in population analyses. Systematic errors are particularly problematic in low coverage experiments, or in estimates of allele-specific expression from RNA-Seq data. Our characterization of systematic error has allowed us to develop a program, called SysCall, for identifying and correcting such errors. We conclude that correction of systematic errors is important to consider in the design and interpretation of high-throughput sequencing experiments.

          Related collections

          Most cited references8

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Improving RNA-Seq expression estimates by correcting for fragment bias

          The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Sequence-specific error profile of Illumina sequencers

            We identified the sequence-specific starting positions of consecutive miscalls in the mapping of reads obtained from the Illumina Genome Analyser (GA). Detailed analysis of the miscall pattern indicated that the underlying mechanism involves sequence-specific interference of the base elongation process during sequencing. The two major sequence patterns that trigger this sequence-specific error (SSE) are: (i) inverted repeats and (ii) GGC sequences. We speculate that these sequences favor dephasing by inhibiting single-base elongation, by: (i) folding single-stranded DNA and (ii) altering enzyme preference. This phenomenon is a major cause of sequence coverage variability and of the unfavorable bias observed for population-targeted methods such as RNA-seq and ChIP-seq. Moreover, SSE is a potential cause of false single-nucleotide polymorphism (SNP) calls and also significantly hinders de novo assembly. This article highlights the importance of recognizing SSE and its underlying mechanisms in the hope of enhancing the potential usefulness of the Illumina sequencers.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Digital RNA Allelotyping Reveals Tissue-specific and Allele-specific Gene Expression in Human

              We developed a digital RNA allelotyping method for quantitatively interrogating allele-specific gene expression. This method involves ultra-deep sequencing of padlock captured SNPs from the transcriptome. We characterized four cell lines established from two human subjects in the Personal Genome Project. Approximately 11–22% of the heterozygous mRNA-associated SNPs show allele-specific expression in each cell line; and 4.3–8.5% are tissue-specific, suggesting the presence of tissue-specific cis-regulation. When applied to two pairs of sibling human embryonic stem cell lines, the sibling lines were more similar in allele-specific expression than were the genetically unrelated lines. We found that the variation of allelic ratios in gene expression among different cell lines is primarily explained by genetic variations, much more so than by specific tissue types or culturing conditions. Comparison of expressed SNPs on the sense and anti-sense transcripts suggested that allelic ratios are primarily determined by cis-regulatory mechanisms on the sense transcripts.
                Bookmark

                Author and article information

                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central
                1471-2105
                2011
                21 November 2011
                : 12
                : 451
                Affiliations
                [1 ]Department of Mathematics, University of California, Berkeley, 970 Evans Hall #3840, Berkeley, CA 94720 USA
                [2 ]Children's Hospital Oakland Research Institute, 5700 Martin Luther King Jr Way, Oakland, CA 94609 USA
                [3 ]Computer Science Division, University of California, Berkeley, 387 Soda Hall, Berkeley, CA 94720 USA
                [4 ]Department of Molecular & Cell Biology, University of California, Berkeley, 142 LSA #3200, Berkeley, CA 94720
                Article
                1471-2105-12-451
                10.1186/1471-2105-12-451
                3295828
                22099972
                a14c7e3c-1b78-4c36-a3b1-e4c2ebab62d6
                Copyright ©2011 Meacham et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 25 May 2011
                : 21 November 2011
                Categories
                Research Article

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article