FMLRC: Hybrid long read error correction using an FM-index

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging “hybrid” assemblies that use long reads for scaffolding and short reads for accuracy.

Results

We describe a novel method leveraging a multi-string Burrows-Wheeler Transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We demonstrate that our method efficiently produces significantly more high quality corrected sequence than existing hybrid error-correction methods. We also show that our method produces more contiguous assemblies, in many cases, than existing state-of-the-art hybrid and long-read only de novo assembly methods.

Conclusion

Our method accurately corrects long read sequence data using complementary short reads. We demonstrate higher total throughput of corrected long reads and a corresponding increase in contiguity of the resulting de novo assemblies. Improved throughput and computational efficiency than existing methods will help better economically utilize emerging long read sequencing technologies.

Related collections

Most cited references 12

Record: found
Abstract: found
Article: found

Is Open Access

Improving PacBio Long Read Accuracy by Short Read Alignment

Kin-Fai Au, Jason Underwood, Lawrence Lee … (2012)

The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.

0 comments Cited 158 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A survey of error-correction methods for next-generation sequencing.

Xiao Xiao Yang, Sriram Chockalingam, Srinivas Aluru (2013)

Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research. All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at: http://aluru-sun.ece.iastate.edu/doku.php?id=ecr.

0 comments Cited 99 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Heng Li (2015)

Motivation: Single Molecule Real-Time (SMRT) sequencing technology and Oxford Nanopore technologies (ONT) produce reads over 10kbp in length, which have enabled high-quality genome assembly at an affordable cost. However, at present, long reads have an error rate as high as 10-15%. Complex and computationally intensive pipelines are required to assemble such reads. Results: We present a new mapper, minimap, and a de novo assembler, miniasm, for efficiently mapping and assembling SMRT and ONT reads without an error correction stage. They can often assemble a sequencing run of bacterial data into a single contig in a few minutes, and assemble 45-fold C. elegans data in 9 minutes, orders of magnitude faster than the existing pipelines. We also introduce a pairwise read mapping format (PAF) and a graphical fragment assembly format (GFA), and demonstrate the interoperability between ours and current tools. Availability and implementation: https://github.com/lh3/minimap and https://github.com/lh3/miniasm Contact: hengli@broadinstitute.org

0 comments Cited 46 times – based on 0 reviews

Preprint

     Review now

Bookmark

All references

Author and article information

Contributors

Jeremy R. Wang: jeremy_wang@med.unc.edu

James Holt: holtjma@cs.unc.edu

Leonard McMillan: mcmillan@cs.unc.edu

Corbin D. Jones: cdjones@email.unc.edu

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date (Electronic): 9 February 2018

Publication date PMC-release: 9 February 2018

Publication date Collection: 2018

Volume: 19

Electronic Location Identifier: 50

Affiliations

[1 ]ISNI 0000000122483208, GRID grid.10698.36, Department of Genetics, , University of North Carolina at Chapel Hill, ; CB 3280, 3144 Genome Sciences Building, 250 Bell Tower Dr, Chapel Hill, 27599 NC USA

[2 ]ISNI 0000000122483208, GRID grid.10698.36, Department of Computer Science, , University of North Carolina at Chapel Hill, ; Chapel Hill, NC USA

[3 ]ISNI 0000000122483208, GRID grid.10698.36, Department of Biology and Integrative Program for Biological and Genome Sciences, , University of North Carolina at Chapel Hill, ; Chapel Hill, NC USA

Article

Publisher ID: 2051

DOI: 10.1186/s12859-018-2051-3

PMC ID: 5807796

PubMed ID: 29426289

SO-VID: 085d115f-9226-4bed-880e-8b377d58b2a6

License:

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 8 February 2017

Date accepted : 1 February 2018

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100000001, National Science Foundation;

Award ID: DEB-1457707

Funded by: FundRef http://dx.doi.org/10.13039/100005562, North Carolina Biotechnology Center;

Award ID: 2013-MRG-1110

Funded by: University Cancer Research Fund

Funded by: FundRef http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;

Award ID: P50 GM076468

Funded by: FundRef http://dx.doi.org/10.13039/100000062, National Institute of Diabetes and Digestive and Kidney Diseases;

Award ID: T32DK007737-17S1

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: de novo assembly,hybrid error correction,long read,pacbio,bwt,fm-index

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: de novo assembly, hybrid error correction, long read, pacbio, bwt, fm-index

FMLRC: Hybrid long read error correction using an FM-index

Read this article at

Abstract

Background

Results

Conclusion

Related collections

Genetoberfest

Most cited references 12

Improving PacBio Long Read Accuracy by Short Read Alignment

A survey of error-correction methods for next-generation sequencing.

Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 76

Cited by 54

Most referenced authors 409