DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed
  Overlap Graph

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The highly anticipated transition to the third generation DNA sequencing (3rdGS) technology have reached a stalemate primarily due to the high error rates (15-45%), which make the assembly of long erroneous reads extremely challenging because existing software solutions for 3rdGS assembly are often overwhelmed by error correction tasks. We report three significant breakthroughs that push the envelope of genome assembly and offer an enabling software solution to overcome the current 3rdGS stalemate. Firstly, we take a counter-intuitive strategy and develop a base-level correction-free assembly algorithm, which resorts to data compression technology and the assembly was performed with the compressed reads. Magnitudes of compression lead to magnitudes of reduction in read lengths, enabling magnitudes of savings in computational time and memory space. We implement the new algorithm in a proof-of-concept software package DBG2OLC. Experiments with the 3rdGS data including PacBio and Oxford Nanopore show that our method is able to assemble large genomes magnitudes more efficiently than existing methods. For example, on a large PacBio human genome dataset we calculated the all-pair alignment of 54x erroneous long reads in 6 hours compared to the 405,000 CPU hours previously reported by Pacific Biosciences. Secondly, while maintaining comparable high quality assemblies, our approach requires significantly lower sequencing coverage (10x-20x) than existing assemblers, which translates to significant cost-cut for genome sequencing. Thirdly, our method is highly adaptive and is the only one to date that demonstrates ultra efficiencies not only for the 3rdGS PacBio and Nanapore sequences, but also for the latest NGS data.

Related collections

Most cited references 14

Record: found
Abstract: not found
Article: not found

Identification of common molecular subsequences.

T.F. Smith, M.S. Waterman (1981)

0 comments Cited 1694 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Efficient de novo assembly of large genomes using compressed data structures.

Jared Simpson, Richard Durbin (2012)

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

0 comments Cited 320 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information

Marten Boetzer, Walter Pirovano (2014)

Background The recent introduction of the Pacific Biosciences RS single molecule sequencing technology has opened new doors to scaffolding genome assemblies in a cost-effective manner. The long read sequence information is promised to enhance the quality of incomplete and inaccurate draft assemblies constructed from Next Generation Sequencing (NGS) data. Results Here we propose a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone. On a test set comprising six bacterial draft genomes, assembled using either a single Illumina MiSeq or Roche 454 library, we show that even a 50× coverage of uncorrected PacBio RS long reads is sufficient to drastically reduce the number of contigs. Comparisons to the AHA scaffolder indicate our strategy is better capable of producing (nearly) complete bacterial genomes. Conclusions The current work describes our SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.

0 comments Cited 283 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 2014-10-10

Publication date Updated: 2015-05-22

Article

ArXiV ID: 1410.2801

SO-VID: 719ba946-3ea4-4ffc-a7c3-4beaae66fbae

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Categories q-bio.GN

ScienceOpen disciplines: Genetics

Data availability:

ScienceOpen disciplines: Genetics

Comments

Comment on this article

Cited by 14

See all cited by

DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed Overlap Graph

Read this article at

Abstract

Related collections

Genome Integrity

Most cited references 14

Identification of common molecular subsequences.

Efficient de novo assembly of large genomes using compressed data structures.

SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 100

Cited by 14