562
views
1
recommends
+1 Recommend
0
shares
    • Review: found
    Is Open Access

    Review of 'FASTA Herder: a web application to trim protein sequence sets'

    Bookmark
    3
    FASTA Herder: a web application to trim protein sequence setsCrossref
    Average rating:
        Rated 3 of 5.
    Level of importance:
        Rated 4 of 5.
    Level of validity:
        Rated 3 of 5.
    Level of completeness:
        Rated 3 of 5.
    Level of comprehensibility:
        Rated 2 of 5.
    Competing interests:
    None

    Reviewed article

    • Record: found
    • Abstract: found
    • Article: found
    Is Open Access

    FASTA Herder: a web application to trim protein sequence sets

    (2014)
    The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .
      Bookmark

      Review information


      Review text

      The authors of this article have identified a real and growing
      difficulty in identifying and comparing homologous sequences -- the
      number of sequences is growing at such an alarming rate that trimming
      the set of results is necessary to make appropriate use of the data.
      The paper proposes that using full length alignments with stringent
      requirements on pairwise alignment coverage, but a weak
      percent-identity threshold (ratio aligned amino acids that are
      identical to amino acid positions that participate in the alignment)
      is useful for identifying orthlogous sequences.

      By its nature, FASTA Herder is limited to full length sequences of
      high quality. Both of these are limitation that for de novo discovery
      would have to be addressed, though these issues could properly be seen as
      topics for further research. Some protein database sequences are
      high-quality but partial, whereas others are full length but of
      questionable quality. Of particular difficultly to FASTA Herder could
      be retained introns. It is also not obvious to me, one way or the
      other, whether FASTA Herder is or could be useful at a lengths shorter
      than the full protein, e.g. at classifying sequences at the domain
      level.

      For a common use case -- the case in which one has a preferred query,
      e.g. human, in mind -- presumably one could use the preferred protein
      as the representative of its group. The authors did not directly
      address this.

      The authors test against OrthoBench, which has "1677 proteins from 12
      Metazoa species grouped into 70 protein families". Performing such a
      test demonstrates the plausibility of their approach. And the use of
      full length alignments is quite interesting. It is fast, conceptually
      simple (to align in such a way, the domain structure would have to
      line up) and would appear to have low risk of overfitting. It is
      intriguing that they do well at low percent identity. The limitation
      is that this a small scale test, using well-curated sequences. The
      test does not directly address the stated problem -- the filtering of
      search results from searches of large databases. The authors state
      "the number of mistakes could be expected to be smaller than in
      real-life applications". So, the results are suggestive and
      encouraging, but not comprehensive.

      I have rated "level of comprehensibility" a bit low for a simple
      reason. The paper has several supplementary tables, but no main
      figure or table at all, much less one that summarizes the main results
      in a comprehensible way. This was, I think, a poor choice and if this
      situation occurred in a pre-reviewed rather than continuously reviewed
      paper, I would ask for a revision that addressed the issue. If the
      journal pushed for all figures and tables to be in supplementary, then
      the journal has made a poor choice. Nonetheless, the supplementary
      tables do not provide a well-presented brief summary of the results.

      Comments

      Comment on this review