Compression of next-generation sequencing reads aided by highly
  efficient de novo assembly

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information, and sequences, effectively collapsing very large datasets to less than 15% of their original size with no loss of information. Availability: Quip is freely available under the BSD license from http://cs.washington.edu/homes/dcjones/quip.

Related collections

Most cited references 13

Record: found
Abstract: found
Article: not found

Efficient de novo assembly of large genomes using compressed data structures.

Jared Simpson, Richard Durbin (2012)

De novo genome sequence assembly is important both to generate new sequence assemblies for previously uncharacterized genomes and to identify the genome sequence of individuals in a reference-unbiased way. We present memory efficient data structures and algorithms for assembly using the FM-index derived from the compressed Burrows-Wheeler transform, and a new assembler based on these called SGA (String Graph Assembler). We describe algorithms to error-correct, assemble, and scaffold large sets of sequence data. SGA uses the overlap-based string graph model of assembly, unlike most de novo assemblers that rely on de Bruijn graphs, and is simply parallelizable. We demonstrate the error correction and assembly performance of SGA on 1.2 billion sequence reads from a human genome, which we are able to assemble using 54 GB of memory. The resulting contigs are highly accurate and contiguous, while covering 95% of the reference genome (excluding contigs <200 bp in length). Because of the low memory requirements and parallelization without requiring inter-process communication, SGA provides the first practical assembler to our knowledge for a mammalian-sized genome on a low-end computing cluster.

0 comments Cited 322 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Efficient storage of high throughput DNA sequencing data using reference-based compression.

Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane … (2011)

Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for resequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.

0 comments Cited 143 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Efficient counting of k-mers in DNA sequences using a bloom filter

Páll Melsted, Jonathan Pritchard (2011)

Background Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. Results We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. Conclusions A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html

0 comments Cited 77 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 10 July 2012

Article

ArXiV ID: 1207.2424

SO-VID: 576f826c-214f-4e1b-9a36-139dda14bcb2

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Categories q-bio.QM cs.DS q-bio.GN

Data availability:

Compression of next-generation sequencing reads aided by highly efficient de novo assembly

Read this article at

Abstract

Related collections

Global Health Next Generation Network

Most cited references 13

Efficient de novo assembly of large genomes using compressed data structures.

Efficient storage of high throughput DNA sequencing data using reference-based compression.

Efficient counting of k-mers in DNA sequences using a bloom filter

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 112

Most referenced authors 139