Repeat-aware modeling and correction of short read errors

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

High-throughput short read sequencing is revolutionizing genomics and systems biology research by enabling cost-effective deep coverage sequencing of genomes and transcriptomes. Error detection and correction are crucial to many short read sequencing applications including de novo genome sequencing, genome resequencing, and digital gene expression analysis. Short read error detection is typically carried out by counting the observed frequencies of kmers in reads and validating those with frequencies exceeding a threshold. In case of genomes with high repeat content, an erroneous kmer may be frequently observed if it has few nucleotide differences with valid kmers with multiple occurrences in the genome. Error detection and correction were mostly applied to genomes with low repeat content and this remains a challenging problem for genomes with high repeat content.

Results

We develop a statistical model and a computational method for error detection and correction in the presence of genomic repeats. We propose a method to infer genomic frequencies of kmers from their observed frequencies by analyzing the misread relationships among observed kmers. We also propose a method to estimate the threshold useful for validating kmers whose estimated genomic frequency exceeds the threshold. We demonstrate that superior error detection is achieved using these methods. Furthermore, we break away from the common assumption of uniformly distributed errors within a read, and provide a framework to model position-dependent error occurrence frequencies common to many short read platforms. Lastly, we achieve better error correction in genomes with high repeat content. Availability: The software is implemented in C++ and is freely available under GNU GPL3 license and Boost Software V1.0 license at “ http://aluru-sun.ece.iastate.edu/doku.php?id=redeem”.

Conclusions

We introduce a statistical framework to model sequencing errors in next-generation reads, which led to promising results in detecting and correcting errors for genomes with high repeat content.

Related collections

Most cited references 12

Record: found
Abstract: found
Article: found

Is Open Access

Using quality scores and longer reads improves accuracy of Solexa read mapping

Andrew D. M. Smith, Zhenyu Xuan, Michael Zhang (2008)

Background Second-generation sequencing has the potential to revolutionize genomics and impact all areas of biomedical science. New technologies will make re-sequencing widely available for such applications as identifying genome variations or interrogating the oligonucleotide content of a large sample (e.g. ChIP-sequencing). The increase in speed, sensitivity and availability of sequencing technology brings demand for advances in computational technology to perform associated analysis tasks. The Solexa/Illumina 1G sequencer can produce tens of millions of reads, ranging in length from ~25–50 nt, in a single experiment. Accurately mapping the reads back to a reference genome is a critical task in almost all applications. Two sources of information that are often ignored when mapping reads from the Solexa technology are the 3' ends of longer reads, which contain a much higher frequency of sequencing errors, and the base-call quality scores. Results To investigate whether these sources of information can be used to improve accuracy when mapping reads, we developed the RMAP tool, which can map reads having a wide range of lengths and allows base-call quality scores to determine which positions in each read are more important when mapping. We applied RMAP to analyze data re-sequenced from two human BAC regions for varying read lengths, and varying criteria for use of quality scores. RMAP is freely available for downloading at . Conclusion Our results indicate that significant gains in Solexa read mapping performance can be achieved by considering the information in 3' ends of longer reads, and appropriately using the base-call quality scores. The RMAP tool we have developed will enable researchers to effectively exploit this information in targeted re-sequencing projects.

0 comments Cited 85 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

RazerS--fast read mapping with sensitivity control.

Tobias Rausch, A Döring, Anne-Katrin Emde … (2009)

Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, efficient algorithms and implementations are crucial for this task. We present an efficient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time.

0 comments Cited 58 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Reptile: representative tiling for short read error correction.

Srinivas Aluru, Karin Dorman, Xiao-Yang Ge (2010)

Error correction is critical to the success of next-generation sequencing applications, such as resequencing and de novo genome sequencing. It is especially important for high-throughput short-read sequencing, where reads are much shorter and more abundant, and errors more frequent than in traditional Sanger sequencing. Processing massive numbers of short reads with existing error correction methods is both compute and memory intensive, yet the results are far from satisfactory when applied to real datasets. We present a novel approach, termed Reptile, for error correction in short-read data from next-generation sequencing. Reptile works with the spectrum of k-mers from the input reads, and corrects errors by simultaneously examining: (i) Hamming distance-based correction possibilities for potentially erroneous k-mers; and (ii) neighboring k-mers from the same read for correct contextual information. By not needing to store input data, Reptile has the favorable property that it can handle data that does not fit in main memory. In addition to sequence data, Reptile can make use of available quality score information. Our experiments show that Reptile outperforms previous methods in the percentage of errors removed from the data and the accuracy in true base assignment. In addition, a significant reduction in run time and memory usage have been achieved compared with previous methods, making it more practical for short-read error correction when sampling larger genomes. Reptile is implemented in C++ and is available through the link: http://aluru-sun.ece.iastate.edu/doku.php?id=software aluru@iastate.edu.

0 comments Cited 52 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Conference

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2011

Publication date (Electronic): 15 February 2011

Volume: 12

Issue: Suppl 1

Page: S52

Affiliations

[1 ]Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa, 50011, USA

[2 ]Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, Maharashtra, 400 076, India

[3 ]Department of Statistics and Department of Genetics, Development & Cell Biology, Iowa State University, Ames, Iowa, 50011, USA

Article

Publisher ID: 1471-2105-12-S1-S52

DOI: 10.1186/1471-2105-12-S1-S52

PMC ID: 3044310

PubMed ID: 21342585

SO-VID: 17f9d969-28fd-4ba3-9878-a9bf219803f7

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Conference name: The Ninth Asia Pacific Bioinformatics Conference (APBC 2011)

Conference location: Inchon, Korea

Conference date: 11–14 January 2011

History

Comments

Comment on this article

scite_

Cited by 11

See all cited by

Most referenced authors 459

See all reference authors

Repeat-aware modeling and correction of short read errors

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Genetoberfest

Most cited references 12

Using quality scores and longer reads improves accuracy of Solexa read mapping

RazerS--fast read mapping with sensitivity control.

Reptile: representative tiling for short read error correction.

Author and article information

Conference

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 58

Cited by 11

Most referenced authors 459