Benchmarking the next generation of homology inference tools

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation: Over the last decades, vast numbers of sequences were deposited in public databases. Bioinformatics tools allow homology and consequently functional inference for these sequences. New profile-based homology search tools have been introduced, allowing reliable detection of remote homologs, but have not been systematically benchmarked. To provide such a comparison, which can guide bioinformatics workflows, we extend and apply our previously developed benchmark approach to evaluate the ‘next generation’ of profile-based approaches, including CS-BLAST, HHSEARCH and PHMMER, in comparison with the non-profile based search tools NCBI-BLAST, USEARCH, UBLAST and FASTA.

Method: We generated challenging benchmark datasets based on protein domain architectures within either the PFAM + Clan, SCOP/Superfamily or CATH/Gene3D domain definition schemes. From each dataset, homologous and non-homologous protein pairs were aligned using each tool, and standard performance metrics calculated. We further measured congruence of domain architecture assignments in the three domain databases.

Results: CSBLAST and PHMMER had overall highest accuracy. FASTA, UBLAST and USEARCH showed large trade-offs of accuracy for speed optimization.

Conclusion: Profile methods are superior at inferring remote homologs but the difference in accuracy between methods is relatively small. PHMMER and CSBLAST stand out with the highest accuracy, yet still at a reasonable computational cost. Additionally, we show that less than 0.1% of Swiss-Prot protein pairs considered homologous by one database are considered non-homologous by another, implying that these classifications represent equivalent underlying biological phenomena, differing mostly in coverage and granularity.

Availability and Implementation: Benchmark datasets and all scripts are placed at ( http://sonnhammer.org/download/Homology_benchmark).

Contact: forslund@ 123456embl.de

Supplementary information: Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 19

Record: found
Abstract: not found
Article: not found

Identification of common molecular subsequences.

T.F. Smith, M.S. Waterman (1981)

0 comments Cited 1695 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Profile hidden Markov models.

S. Eddy (1998)

The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.

0 comments Cited 1254 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Amino acid substitution matrices from protein blocks.

S Henikoff, J. Henikoff (1992)

Methods for alignment of protein sequences typically measure similarity by using a substitution matrix with scores for all possible exchanges of one amino acid with another. The most widely used matrices are based on the Dayhoff model of evolutionary rates. Using a different approach, we have derived substitution matrices from about 2000 blocks of aligned sequence segments characterizing more than 500 groups of related proteins. This led to marked improvements in alignments and in searches using queries from each of the groups.

0 comments Cited 1081 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Journal ID (hwp): bioinfo

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 01 September 2016

Publication date (Electronic): 01 June 2016

Publication date PMC-release: 01 June 2016

Volume: 32

Issue: 17

Pages: 2636-2641

Affiliations

¹Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Stockholm SE-10691, Sweden

²European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg 69117, Germany

Author notes

*To whom correspondence should be addressed.

Associate Editor: Burkhard Rost

Article

Publisher ID: btw305

DOI: 10.1093/bioinformatics/btw305

PMC ID: 5013910

PubMed ID: 27256311

SO-VID: 75a8fd39-a606-49f3-a13a-0db87fb0775d

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 02 July 2015

Date revision received : 11 April 2016

Date accepted : 05 May 2016

Page count

Pages: 6

Comments

Comment on this article

scite_

Cited by 6

See all cited by

Most referenced authors 1,663

See all reference authors

Benchmarking the next generation of homology inference tools

Read this article at

Abstract

Related collections

Genetoberfest

Most cited references 19

Identification of common molecular subsequences.

Profile hidden Markov models.

Amino acid substitution matrices from protein blocks.

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 28

Cited by 6

Most referenced authors 1,663