Blog
About

35
views
1
recommends
+1 Recommend
0
shares
  • Review: found
Is Open Access

Review of 'FASTA Herder: a web application to trim protein sequence sets'

Bookmark
3
Average rating:
    Rated 3 of 5.
Level of importance:
    Rated 4 of 5.
Level of validity:
    Rated 3 of 5.
Level of completeness:
    Rated 3 of 5.
Level of comprehensibility:
    Rated 2 of 5.
Competing interests:
None

Reviewed article

  • Record: found
  • Abstract: found
  • Article: found
Is Open Access

FASTA Herder: a web application to trim protein sequence sets

(2014)
The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .
    Bookmark

    Review information

    10.14293/S2199-1006.1.SOR-LIFE.A67837.v1.RPDLCU

    This work has been published open access under Creative Commons Attribution License CC BY 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com.

    Review text

    The authors of this article have identified a real and growing
    difficulty in identifying and comparing homologous sequences -- the
    number of sequences is growing at such an alarming rate that trimming
    the set of results is necessary to make appropriate use of the data.
    The paper proposes that using full length alignments with stringent
    requirements on pairwise alignment coverage, but a weak
    percent-identity threshold (ratio aligned amino acids that are
    identical to amino acid positions that participate in the alignment)
    is useful for identifying orthlogous sequences.

    By its nature, FASTA Herder is limited to full length sequences of
    high quality. Both of these are limitation that for de novo discovery
    would have to be addressed, though these issues could properly be seen as
    topics for further research. Some protein database sequences are
    high-quality but partial, whereas others are full length but of
    questionable quality. Of particular difficultly to FASTA Herder could
    be retained introns. It is also not obvious to me, one way or the
    other, whether FASTA Herder is or could be useful at a lengths shorter
    than the full protein, e.g. at classifying sequences at the domain
    level.

    For a common use case -- the case in which one has a preferred query,
    e.g. human, in mind -- presumably one could use the preferred protein
    as the representative of its group. The authors did not directly
    address this.

    The authors test against OrthoBench, which has "1677 proteins from 12
    Metazoa species grouped into 70 protein families". Performing such a
    test demonstrates the plausibility of their approach. And the use of
    full length alignments is quite interesting. It is fast, conceptually
    simple (to align in such a way, the domain structure would have to
    line up) and would appear to have low risk of overfitting. It is
    intriguing that they do well at low percent identity. The limitation
    is that this a small scale test, using well-curated sequences. The
    test does not directly address the stated problem -- the filtering of
    search results from searches of large databases. The authors state
    "the number of mistakes could be expected to be smaller than in
    real-life applications". So, the results are suggestive and
    encouraging, but not comprehensive.

    I have rated "level of comprehensibility" a bit low for a simple
    reason. The paper has several supplementary tables, but no main
    figure or table at all, much less one that summarizes the main results
    in a comprehensible way. This was, I think, a poor choice and if this
    situation occurred in a pre-reviewed rather than continuously reviewed
    paper, I would ask for a revision that addressed the issue. If the
    journal pushed for all figures and tables to be in supplementary, then
    the journal has made a poor choice. Nonetheless, the supplementary
    tables do not provide a well-presented brief summary of the results.

    Comments

    Comment on this review