Fast alignment-free sequence comparison using spaced-word frequencies

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent.

Results: To reduce the statistical dependency between adjacent word matches, we propose to use ‘spaced words’, defined by patterns of ‘match’ and ‘don’t care’ positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words.

Availability and implementation: Our program is freely available at http://spaced.gobics.de/ .

Contact: chris.leimeister@ 123456stud.uni-goettingen.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 46

Record: found
Abstract: found
Article: not found

PatternHunter: faster and more sensitive homology search.

B. Ma, J. Tromp, M. Li (2002)

Genomics and proteomics studies routinely depend on homology searches based on the strategy of finding short seed matches which are then extended. The exploding genomic data growth presents a dilemma for DNA homology search techniques: increasing seed size decreases sensitivity whereas decreasing seed size slows down computation. We present a new homology search algorithm 'PatternHunter' that uses a novel seed model for increased sensitivity and new hit-processing techniques for significantly increased speed. At Blast levels of sensitivity, PatternHunter is able to find homologies between sequences as large as human chromosomes, in mere hours on a desktop. PatternHunter is available at http://www.bioinformaticssolutions.com, as a commercial package. It runs on all platforms that support Java. PatternHunter technology is being patented; commercial use requires a license from BSI, while non-commercial use will be free.

0 comments Cited 208 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Alignment-free sequence comparison-a review.

Susana Vinga, Jonas Almeida, John Osborne (2003)

Genetic recombination and, in particular, genetic shuffling are at odds with sequence comparison by alignment, which assumes conservation of contiguity between homologous segments. A variety of theoretical foundations are being used to derive alignment-free methods that overcome this limitation. The formulation of alternative metrics for dissimilarity between sequences and their algorithmic implementations are reviewed. The overwhelming majority of work on alignment-free sequence has taken place in the past two decades, with most reports published in the past 5 years. Two main categories of methods have been proposed-methods based on word (oligomer) frequency, and methods that do not require resolving the sequence with fixed word length segments. The first category is based on the statistics of word frequency, on the distances defined in a Cartesian space defined by the frequency vectors, and on the information content of frequency distribution. The second category includes the use of Kolmogorov complexity and Chaos Theory. Despite their low visibility, alignment-free metrics are in fact already widely used as pre-selection filters for alignment-based querying of large applications. Recent work is furthering their usage as a scale-independent methodology that is capable of recognizing homology when loss of contiguity is beyond the possibility of alignment. Most of the alignment-free algorithms reviewed were implemented in MATLAB code and are available at http://bioinformatics.musc.edu/resources.html

0 comments Cited 187 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Comparing sequences without using alignments: application to HIV/SIV subtyping

Gilles Didier, Laurent Debomy, Maude Pupin … (2007)

Background In general, the construction of trees is based on sequence alignments. This procedure, however, leads to loss of informationwhen parts of sequence alignments (for instance ambiguous regions) are deleted before tree building. To overcome this difficulty, one of us previously introduced a new and rapid algorithm that calculates dissimilarity matrices between sequences without preliminary alignment. Results In this paper, HIV (Human Immunodeficiency Virus) and SIV (Simian Immunodeficiency Virus) sequence data are used to evaluate this method. The program produces tree topologies that are identical to those obtained by a combination of standard methods detailed in the HIV Sequence Compendium. Manual alignment editing is not necessary at any stage. Furthermore, only one user-specified parameter is needed for constructing trees. Conclusion The extensive tests on HIV/SIV subtyping showed that the virus classifications produced by our method are in good agreement with our best taxonomic knowledge, even in non-coding LTR (Long Terminal Repeat) regions that are not tractable by regular alignment methods due to frequent duplications/insertions/deletions. Our method, however, is not limited to the HIV/SIV subtyping. It provides an alternative tree construction without a time-consuming aligning procedure.

0 comments Cited 179 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Journal ID (hwp): bioinfo

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 15 July 2014

Publication date (Electronic): 03 April 2014

Publication date PMC-release: 03 April 2014

Volume: 30

Issue: 14

Pages: 1991-1999

Affiliations

¹Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, 37073 Göttingen, Germany and ²Université d’Évry Val d’Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA, 91037 Évry, France

Author notes

*To whom correspondence should be addressed.

Associate Editor: John Hancock

Article

Publisher ID: btu177

DOI: 10.1093/bioinformatics/btu177

PMC ID: 4080745

PubMed ID: 24700317

SO-VID: adf65875-bf72-4926-8438-bcb6d2968879

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 25 November 2013

Date revision received : 18 March 2014

Date accepted : 30 March 2014

Page count

Pages: 9

Comments

Comment on this article

scite_

Cited by 63

See all cited by

Most referenced authors 1,879

See all reference authors

Fast alignment-free sequence comparison using spaced-word frequencies

Read this article at

Abstract

Related collections

Genetoberfest

Most cited references 46

PatternHunter: faster and more sensitive homology search.

Alignment-free sequence comparison-a review.

Comparing sequences without using alignments: application to HIV/SIV subtyping

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 120

Cited by 63

Most referenced authors 1,879