Query-seeded iterative sequence similarity searching improves selectivity 5–20-fold

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity.

Related collections

Most cited references 11

Record: found
Abstract: found
Article: not found

Selecting the Right Similarity-Scoring Matrix.

William Pearson (2013)

Protein sequence similarity searching programs like BLASTP, SSEARCH (UNIT 3.10), and FASTA use scoring matrices that are designed to identify distant evolutionary relationships (BLOSUM62 for BLAST, BLOSUM50 for SEARCH and FASTA). Different similarity scoring matrices are most effective at different evolutionary distances. "Deep" scoring matrices like BLOSUM62 and BLOSUM50 target alignments with 20 - 30% identity, while "shallow" scoring matrices (e.g. VTML10 - VTML80), target alignments that share 90 - 50% identity, reflecting much less evolutionary change. While "deep" matrices provide very sensitive similarity searches, they also require longer sequence alignments and can sometimes produce alignment overextension into non-homologous regions. Shallower scoring matrices are more effective when searching for short protein domains, or when the goal is to limit the scope of the search to sequences that are likely to be orthologous between recently diverged organisms. Likewise, in DNA searches, the match and mismatch parameters set evolutionary look-back times and domain boundaries. In this unit, we will discuss the theoretical foundations that drive practical choices of protein and DNA similarity scoring matrices and gap penalties. Deep scoring matrices (BLOSUM62 and BLOSUM50) should be used for sensitive searches with full-length protein sequences, but short domains or restricted evolutionary look-back require shallower scoring matrices.

0 comments Cited 54 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions.

Yi-Kuo Yu, Stephen F Altschul (2005)

Amino acid substitution matrices play a central role in protein alignment methods. Standard log-odds matrices, such as those of the PAM and BLOSUM series, are constructed from large sets of protein alignments having implicit background amino acid frequencies. However, these matrices frequently are used to compare proteins with markedly different amino acid compositions, such as transmembrane proteins or proteins from organisms with strongly biased nucleotide compositions. It has been argued elsewhere that standard matrices are not ideal for such comparisons and, furthermore, a rationale has been presented for transforming a standard matrix for use in a non-standard compositional context. This paper presents the mathematical details underlying the compositional adjustment of amino acid or DNA substitution matrices.

0 comments Cited 41 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Sensitivity and selectivity in protein structure comparison.

Terry W. Pearson, Michael L Sierk (2004)

Seven protein structure comparison methods and two sequence comparison programs were evaluated on their ability to detect either protein homologs or domains with the same topology (fold) as defined by the CATH structure database. The structure alignment programs Dali, Structal, Combinatorial Extension (CE), VAST, and Matras were tested along with SGM and PRIDE, which calculate a structural distance between two domains without aligning them. We also tested two sequence alignment programs, SSEARCH and PSI-BLAST. Depending upon the level of selectivity and error model, structure alignment programs can detect roughly twice as many homologous domains in CATH as sequence alignment programs. Dali finds the most homologs, 321-533 of 1120 possible true positives (28.7%-45.7%), at an error rate of 0.1 errors per query (EPQ), whereas PSI-BLAST finds 365 true positives (32.6%), regardless of the error model. At an EPQ of 1.0, Dali finds 42%-70% of possible homologs, whereas Matras finds 49%-57%; PSI-BLAST finds 36.9%. However, Dali achieves >84% coverage before the first error for half of the families tested. Dali and PSI-BLAST find 9.2% and 5.2%, respectively, of the 7056 possible topology pairs at an EPQ of 0.1 and 19.5, and 5.9% at an EPQ of 1.0. Most statistical significance estimates reported by the structural alignment programs overestimate the significance of an alignment by orders of magnitude when compared with the actual distribution of errors. These results help quantify the statistical distinction between analogous and homologous structures, and provide a benchmark for structure comparison statistics.

0 comments Cited 37 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Nucleic Acids Res

Journal ID (iso-abbrev): Nucleic Acids Res

Journal ID (publisher-id): nar

Title: Nucleic Acids Research

Publisher: Oxford University Press

ISSN (Print): 0305-1048

ISSN (Electronic): 1362-4962

Publication date (Print): 20 April 2017

Publication date (Electronic): 06 December 2016

Publication date PMC-release: 06 December 2016

Volume: 45

Issue: 7

Page: e46

Affiliations

[1 ]Dept. of Biochemistry and Molecular Genetics, University of Virginia, School of Medicine, Charlottesville, VA 22908, USA

[2 ]European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Author notes

[* ]To whom correspondence should be addressed. Tel: +1 434 924 2818; Fax: +1 434 924 5069; Email: wrp@ 123456virginia.edu

Article

Publisher ID: gkw1207

DOI: 10.1093/nar/gkw1207

PMC ID: 5605230

PubMed ID: 27923999

SO-VID: 166f0343-4229-4232-a873-4519928133a8

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date accepted : 18 November 2016

Date revision received : 14 November 2016

Date received : 23 September 2016

Page count

Pages: 10

Comments

Comment on this article

scite_

Cited by 11

See all cited by

Most referenced authors 834

See all reference authors

Query-seeded iterative sequence similarity searching improves selectivity 5–20-fold

Read this article at

Abstract

Related collections

Genomic Prediction

Most cited references 11

Selecting the Right Similarity-Scoring Matrix.

The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions.

Sensitivity and selectivity in protein structure comparison.

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 15

Cited by 11

Most referenced authors 834