There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Research in bioinformatics primarily involves collection and analysis of a large volume of genomic data. Naturally, it demands efficient storage and transfer of this huge amount of data. In recent years, some research has been done to find efficient compression algorithms to reduce the size of various sequencing data. One way to improve the transmission time of large files is to apply a maximum lossless compression on them. In this paper, we present SAMZIP, a specialized encoding scheme, for sequence alignment data in SAM (Sequence Alignment/Map) format, which improves the compression ratio of existing compression tools available. In order to achieve this, we exploit the prior knowledge of the file format and specifications. Our experimental results show that our encoding scheme improves compression ratio, thereby reducing overall transmission time significantly.

Related collections

Most cited references 3

Record: found
Abstract: found
Article: not found

Compression of DNA sequence reads in FASTQ format.

Sebastian Deorowicz, Szymon Grabowski (2011)

Modern sequencing instruments are able to generate at least hundreds of millions short reads of genomic data. Those huge volumes of data require effective means to store them, provide quick access to any record and enable fast decompression. We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org). DSRC is freely available at http:/sun.aei.polsl.pl/dsrc.

0 comments Cited 32 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Compressing genomic sequence fragments using SlimGene.

Vineet Bafna, Semyon Kruglyak, Christos Kozanitis … (2011)

With the advent of next generation sequencing technologies, the cost of sequencing whole genomes is poised to go below $1000 per human individual in a few years. As more and more genomes are sequenced, analysis methods are undergoing rapid development, making it tempting to store sequencing data for long periods of time so that the data can be re-analyzed with the latest techniques. The challenging open research problems, huge influx of data, and rapidly improving analysis techniques have created the need to store and transfer very large volumes of data. Compression can be achieved at many levels, including trace level (compressing image data), sequence level (compressing a genomic sequence), and fragment-level (compressing a set of short, redundant fragment reads, along with quality-values on the base-calls). We focus on fragment-level compression, which is the pressing need today. Our article makes two contributions, implemented in a tool, SlimGene. First, we introduce a set of domain specific loss-less compression schemes that achieve over 40× compression of fragments, outperforming bzip2 by over 6×. Including quality values, we show a 5× compression using less running time than bzip2. Second, given the discrepancy between the compression factor obtained with and without quality values, we initiate the study of using "lossy" quality values. Specifically, we show that a lossy quality value quantization results in 14× compression but has minimal impact on downstream applications like SNP calling that use the quality values. Discrepancies between SNP calls made between the lossy and loss-less versions of the data are limited to low coverage areas where even the SNP calls made by the loss-less version are marginal.

0 comments Cited 31 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

A novel compression tool for efficient storage of genome resequencing data

Congmao Wang, Dabing Zhang (2011)

With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ∼159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.

0 comments Cited 22 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Electronic): 1932-6203

Publication date Collection: 2011

Publication date (Electronic): 2 December 2011

Volume: 6

Issue: 12

Electronic Location Identifier: e28251

Affiliations

[1 ]Department of Computer Science & Engineering, University of South Carolina, Columbia, South Carolina, United States of America

[2 ]Department of Biochemistry, Medical University of South Carolina, Charleston, South Carolina, United States of America

National Institutes of Health, United States of America

Author notes

* E-mail: huangct@ 123456cec.sc.edu

Conceived and designed the experiments: MNS JT C-TH. Performed the experiments: MNS. Analyzed the data: MNS C-TH. Contributed reagents/materials/analysis tools: JT WJZ. Wrote the paper: MNS C-TH. Designed the software used in analysis: MNS. Acquisition of sample data: WJZ.

Article

Publisher ID: PONE-D-11-11665

DOI: 10.1371/journal.pone.0028251

PMC ID: 3229529

PubMed ID: 22164252

SO-VID: b4ab0fd1-9785-43b5-9b8d-c9c6544b54bc

Copyright © Sakib et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 24 June 2011

Date accepted : 4 November 2011

Page count

Pages: 4

Comments

Comment on this article

scite_

Cited by 9

See all cited by

Most referenced authors 183

See all reference authors

- Version 1

Improving Transmission Efficiency of Large Sequence Alignment/Map (SAM) Files

Read this article at

Abstract

Related collections

PLOS Climate

Most cited references 3

Compression of DNA sequence reads in FASTQ format.

Compressing genomic sequence fragments using SlimGene.

A novel compression tool for efficient storage of genome resequencing data

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 49

Cited by 9

Most referenced authors 183