35
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Improving Transmission Efficiency of Large Sequence Alignment/Map (SAM) Files

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Research in bioinformatics primarily involves collection and analysis of a large volume of genomic data. Naturally, it demands efficient storage and transfer of this huge amount of data. In recent years, some research has been done to find efficient compression algorithms to reduce the size of various sequencing data. One way to improve the transmission time of large files is to apply a maximum lossless compression on them. In this paper, we present SAMZIP, a specialized encoding scheme, for sequence alignment data in SAM (Sequence Alignment/Map) format, which improves the compression ratio of existing compression tools available. In order to achieve this, we exploit the prior knowledge of the file format and specifications. Our experimental results show that our encoding scheme improves compression ratio, thereby reducing overall transmission time significantly.

          Related collections

          Most cited references3

          • Record: found
          • Abstract: found
          • Article: not found

          Compression of DNA sequence reads in FASTQ format.

          Modern sequencing instruments are able to generate at least hundreds of millions short reads of genomic data. Those huge volumes of data require effective means to store them, provide quick access to any record and enable fast decompression. We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org). DSRC is freely available at http:/sun.aei.polsl.pl/dsrc.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Compressing genomic sequence fragments using SlimGene.

            With the advent of next generation sequencing technologies, the cost of sequencing whole genomes is poised to go below $1000 per human individual in a few years. As more and more genomes are sequenced, analysis methods are undergoing rapid development, making it tempting to store sequencing data for long periods of time so that the data can be re-analyzed with the latest techniques. The challenging open research problems, huge influx of data, and rapidly improving analysis techniques have created the need to store and transfer very large volumes of data. Compression can be achieved at many levels, including trace level (compressing image data), sequence level (compressing a genomic sequence), and fragment-level (compressing a set of short, redundant fragment reads, along with quality-values on the base-calls). We focus on fragment-level compression, which is the pressing need today. Our article makes two contributions, implemented in a tool, SlimGene. First, we introduce a set of domain specific loss-less compression schemes that achieve over 40× compression of fragments, outperforming bzip2 by over 6×. Including quality values, we show a 5× compression using less running time than bzip2. Second, given the discrepancy between the compression factor obtained with and without quality values, we initiate the study of using "lossy" quality values. Specifically, we show that a lossy quality value quantization results in 14× compression but has minimal impact on downstream applications like SNP calling that use the quality values. Discrepancies between SNP calls made between the lossy and loss-less versions of the data are limited to low coverage areas where even the SNP calls made by the loss-less version are marginal.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A novel compression tool for efficient storage of genome resequencing data

              With the advent of DNA sequencing technologies, more and more reference genome sequences are available for many organisms. Analyzing sequence variation and understanding its biological importance are becoming a major research aim. However, how to store and process the huge amount of eukaryotic genome data, such as those of the human, mouse and rice, has become a challenge to biologists. Currently available bioinformatics tools used to compress genome sequence data have some limitations, such as the requirement of the reference single nucleotide polymorphisms (SNPs) map and information on deletions and insertions. Here, we present a novel compression tool for storing and analyzing Genome ReSequencing data, named GRS. GRS is able to process the genome sequence data without the use of the reference SNPs and other sequence variation information and automatically rebuild the individual genome sequence data using the reference genome sequence. When its performance was tested on the first Korean personal genome sequence data set, GRS was able to achieve ∼159-fold compression, reducing the size of the data from 2986.8 to 18.8 MB. While being tested against the sequencing data from rice and Arabidopsis thaliana, GRS compressed the 361.0 MB rice genome data to 4.4 MB, and the A. thaliana genome data from 115.1 MB to 6.5 KB. This de novo compression tool is available at http://gmdd.shgmo.org/Computational-Biology/GRS.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2011
                2 December 2011
                : 6
                : 12
                : e28251
                Affiliations
                [1 ]Department of Computer Science & Engineering, University of South Carolina, Columbia, South Carolina, United States of America
                [2 ]Department of Biochemistry, Medical University of South Carolina, Charleston, South Carolina, United States of America
                National Institutes of Health, United States of America
                Author notes

                Conceived and designed the experiments: MNS JT C-TH. Performed the experiments: MNS. Analyzed the data: MNS C-TH. Contributed reagents/materials/analysis tools: JT WJZ. Wrote the paper: MNS C-TH. Designed the software used in analysis: MNS. Acquisition of sample data: WJZ.

                Article
                PONE-D-11-11665
                10.1371/journal.pone.0028251
                3229529
                22164252
                b4ab0fd1-9785-43b5-9b8d-c9c6544b54bc
                Sakib et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                : 24 June 2011
                : 4 November 2011
                Page count
                Pages: 4
                Categories
                Research Article
                Biology
                Computational Biology
                Genomics
                Genome Databases
                Genome Sequencing
                Biological Data Management
                Sequence Analysis
                Genomics
                Genome Databases
                Sequence Databases
                Genome Sequencing
                Computer Science
                Algorithms
                Information Technology
                Databases

                Uncategorized
                Uncategorized

                Comments

                Comment on this article