Burrows-Wheeler transform for terabases

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In order to avoid the reference bias introduced by mapping reads to a reference genome, bioinformaticians are investigating reference-free methods for analyzing sequenced genomes. With large projects sequencing thousands of individuals, this raises the need for tools capable of handling terabases of sequence data. A key method is the Burrows-Wheeler transform (BWT), which is widely used for compressing and indexing reads. We propose a practical algorithm for building the BWT of a large read collection by merging the BWTs of subcollections. With our 2.4 Tbp datasets, the algorithm can merge 600 Gbp/day on a single system, using 30 gigabytes of memory overhead on top of the run-length encoded BWTs.

Related collections

Author and article information

Journal

Publication date Created: 2015-11-03

Publication date Updated: 2016-01-14

Article

ArXiV ID: 1511.00898

SO-VID: 061d4add-c185-4d0e-ab30-d7e78acc83e0

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments This is the full version of the paper that was accepted to DCC 2016. The implementation is available at https://github.com/jltsiren/bwt-merge

Categories cs.DS

ScienceOpen disciplines: Data structures & Algorithms

Data availability:

ScienceOpen disciplines: Data structures & Algorithms

Burrows-Wheeler transform for terabases

Read this article at

Abstract

Related collections

Nanopublications (single, attributable and machine-readable assertions in scientific literature)

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 26