LoRDEC: accurate and efficient long read error correction

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation: PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.

Results: We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy.

Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.

Contact: lordec@ 123456lirmm.fr .

Supplementary information: Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 14

Record: found
Abstract: found
Article: not found

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

S. L. Salzberg, A. M. Phillippy, A. Zimin … (2012)

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

0 comments Cited 292 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Improving PacBio Long Read Accuracy by Short Read Alignment

Kin-Fai Au, Jason Underwood, Lawrence Lee … (2012)

The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.

0 comments Cited 159 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A survey of error-correction methods for next-generation sequencing.

Xiao Xiao Yang, Sriram Chockalingam, Srinivas Aluru (2013)

Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research. All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at: http://aluru-sun.ece.iastate.edu/doku.php?id=ecr.

0 comments Cited 100 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Journal ID (hwp): bioinfo

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 15 December 2014

Publication date (Electronic): 26 August 2014

Publication date PMC-release: 26 August 2014

Volume: 30

Issue: 24

Pages: 3506-3514

Affiliations

¹Department of Computer Science and Helsinki Institute for Information Technology HIIT, FI-00014 University of Helsinki, Finland and ²LIRMM and Institut de Biologie Computationelle, CNRS and Université Montpellier, 34095 Montpellier Cedex 5, France

Author notes

*To whom correspondence should be addressed.

Associate Editor: Michael Brudno

Article

Publisher ID: btu538

DOI: 10.1093/bioinformatics/btu538

PMC ID: 4253826

PubMed ID: 25165095

SO-VID: 2da77da3-3288-4193-9e8b-55a4b75c7d69

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 6 April 2014

Date revision received : 28 July 2014

Date accepted : 4 August 2014

Page count

Pages: 9

Comments

Comment on this article

scite_

Cited by 327

See all cited by

Most referenced authors 892

See all reference authors

LoRDEC: accurate and efficient long read error correction

Read this article at

Abstract

Related collections

REPO4EU WP2 Databases

Most cited references 14

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Improving PacBio Long Read Accuracy by Short Read Alignment

A survey of error-correction methods for next-generation sequencing.

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 22

Cited by 327

Most referenced authors 892