Review of 'FASTA Herder: a web application to trim protein sequence sets'

Reviewer: E. Michael Gertz

Publication date of review: 2014-06-11

Bookmark

E. Michael Gertz3

FASTA Herder: a web application to trim protein sequence setsCrossref ScienceOpen

Average rating:	    Rated 3 of 5.
Level of importance:	    Rated 4 of 5.
Level of validity:	    Rated 3 of 5.
Level of completeness:	    Rated 3 of 5.
Level of comprehensibility:	    Rated 2 of 5.
Competing interests:	None

Reviewed article

Record: found
Abstract: found
Article: found

Is Open Access

FASTA Herder: a web application to trim protein sequence sets

(2014)

The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .

0 comments Cited 0 times     Rated -3 of 5. – based on 2 reviews

Version 1

Bookmark

Review information

Review text

The authors of this article have identified a real and growing
difficulty in identifying and comparing homologous sequences -- the
number of sequences is growing at such an alarming rate that trimming
the set of results is necessary to make appropriate use of the data.
The paper proposes that using full length alignments with stringent
requirements on pairwise alignment coverage, but a weak
percent-identity threshold (ratio aligned amino acids that are
identical to amino acid positions that participate in the alignment)
is useful for identifying orthlogous sequences.

By its nature, FASTA Herder is limited to full length sequences of
high quality. Both of these are limitation that for de novo discovery
would have to be addressed, though these issues could properly be seen as
topics for further research. Some protein database sequences are
high-quality but partial, whereas others are full length but of
questionable quality. Of particular difficultly to FASTA Herder could
be retained introns. It is also not obvious to me, one way or the
other, whether FASTA Herder is or could be useful at a lengths shorter
than the full protein, e.g. at classifying sequences at the domain
level.

For a common use case -- the case in which one has a preferred query,
e.g. human, in mind -- presumably one could use the preferred protein
as the representative of its group. The authors did not directly
address this.

The authors test against OrthoBench, which has "1677 proteins from 12
Metazoa species grouped into 70 protein families". Performing such a
test demonstrates the plausibility of their approach. And the use of
full length alignments is quite interesting. It is fast, conceptually
simple (to align in such a way, the domain structure would have to
line up) and would appear to have low risk of overfitting. It is
intriguing that they do well at low percent identity. The limitation
is that this a small scale test, using well-curated sequences. The
test does not directly address the stated problem -- the filtering of
search results from searches of large databases. The authors state
"the number of mistakes could be expected to be smaller than in
real-life applications". So, the results are suggestive and
encouraging, but not comprehensive.

I have rated "level of comprehensibility" a bit low for a simple
reason. The paper has several supplementary tables, but no main
figure or table at all, much less one that summarizes the main results
in a comprehensible way. This was, I think, a poor choice and if this
situation occurred in a pre-reviewed rather than continuously reviewed
paper, I would ask for a revision that addressed the issue. If the
journal pushed for all figures and tables to be in supplementary, then
the journal has made a poor choice. Nonetheless, the supplementary
tables do not provide a well-presented brief summary of the results.

Comments

Comment on this review

Version and Review History

Version 2

Version 1

Reviewed by E. Michael Gertz Reviewed by Claudia Chica