Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server.

Results

We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy.

Conclusion

TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.

Related collections

Most cited references 45

Record: found
Abstract: not found
Article: not found

Identification of common molecular subsequences.

T.F. Smith, M.S. Waterman (1981)

0 comments Cited 1726 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Amino acid substitution matrices from protein blocks.

S Henikoff, J. Henikoff (1992)

Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

0 comments Cited 1116 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Improved tools for biological sequence comparison.

W R Pearson, D J Lipman (1988)

We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.

0 comments Cited 855 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Biol

Title: BMC Biology

Publisher: BioMed Central (London )

ISSN (Electronic): 1741-7007

Publication date Collection: 2006

Publication date (Electronic): 7 December 2006

Volume: 4

Page: 41

Affiliations

[1 ]National Center for Biotechnology Information, National Institutes of Health, Department of Health and Human Services, Bethesda, MD, USA

Article

Publisher ID: 1741-7007-4-41

DOI: 10.1186/1741-7007-4-41

PMC ID: 1779365

PubMed ID: 17156431

SO-VID: c76ff105-0fb2-41b3-80c6-16b3dc1a9175

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 6 September 2006

Date accepted : 7 December 2006

Comments

Comment on this article

scite_

Cited by 240

See all cited by

Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST

Read this article at

Abstract

Background

Results

Conclusion

Related collections

Recursive Rule based Visual Categorization

Most cited references 45

Identification of common molecular subsequences.

Amino acid substitution matrices from protein blocks.

Improved tools for biological sequence comparison.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 98

Cited by 240