Canu: scalable and accurate long-read assembly via adaptive <i>k</i>-mer weighting and repeat separation

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

Related collections

Most cited references 76

Record: found
Abstract: found
Article: found

Is Open Access

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Bruce Walker, Thomas Abeel, Terrance Shea … (2014)

Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and calling sequence variants of multiple sizes, including very large insertions and deletions. Pilon works with many types of sequence data, but is particularly strong when supplied with paired end data from two Illumina libraries with small e.g., 180 bp and large e.g., 3–5 Kb inserts. Pilon significantly improves draft genome assemblies by correcting bases, fixing mis-assemblies and filling gaps. For both haploid and diploid genomes, Pilon produces more contiguous genomes with fewer errors, enabling identification of more biologically relevant genes. Furthermore, Pilon identifies small variants with high accuracy as compared to state-of-the-art tools and is unique in its ability to accurately identify large sequence variants including duplications and resolve large insertions. Pilon is being used to improve the assemblies of thousands of new genomes and to identify variants from thousands of clinically relevant bacterial strains. Pilon is freely available as open source software.

0 comments Cited 2719 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Fast and accurate de novo genome assembly from long uncorrected reads

Robert Vaser, Ivan Sović, Niranjan Nagarajan … (2017)

The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource-intensive error-correction and consensus-generation steps to obtain high-quality assemblies. We show that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment–based, stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore data sets, we show that Racon coupled with miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.

0 comments Cited 924 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Daniel Zerbino, Ewan Birney (2008)

We have developed a new set of algorithms, collectively called "Velvet," to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25-50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of approximately 8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

0 comments Cited 901 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Genome Res

Journal ID (iso-abbrev): Genome Res

Journal ID (hwp): genome

Journal ID (pmc): genome

Journal ID (publisher-id): GENOME

Title: Genome Research

Publisher: Cold Spring Harbor Laboratory Press

ISSN (Print): 1088-9051

ISSN (Electronic): 1549-5469

Publication date (Print): May 2017

Publication date PMC-release: May 2017

Volume: 27

Issue: 5

Pages: 722-736

Affiliations

[1 ]Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;

[2 ]Invincea Incorporated, Fairfax, Virginia 22030, USA;

[3 ]J. Craig Venter Institute, Rockville, Maryland 20850, USA;

[4 ]National Biodefense Analysis and Countermeasures Center, Frederick, Maryland 21702, USA

Author notes

[5]

These authors contributed equally to this work.

Corresponding author: adam.phillippy@ 123456nih.gov

Author information

Adam M. Phillippy http://orcid.org/0000-0003-2983-8934

Article

Medline ID: 9509184

DOI: 10.1101/gr.215087.116

PMC ID: 5411767

PubMed ID: 28298431

SO-VID: 0ef173b8-c7c3-4242-a154-4197c9e4249f

License:

This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

History

Date received : 23 August 2016

Date accepted : 3 March 2017

Funding

Funded by: National Human Genome Research Institute http://dx.doi.org/10.13039/100000051

Funded by: National Institutes of Health http://dx.doi.org/10.13039/100000002

Award ID: HSHQDC-07-C-00020

Funded by: US Department of Homeland Security http://dx.doi.org/10.13039/100000180

Funded by: National Science Foundation http://dx.doi.org/10.13039/100000001

Award ID: NSF IOS-1237993

Comments

Comment on this article

scite_

Cited by 3,018

See all cited by

Most referenced authors 4,020

See all reference authors

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

Read this article at

Abstract

Related collections

CRISPR/Cas9 editing in human blood

Most cited references 76

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Fast and accurate de novo genome assembly from long uncorrected reads

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Funding

Categories

Comments

Comment on this article

Similar content 31

Cited by 3,018

Most referenced authors 4,020