+1 Recommend
    • Review: found
    Is Open Access

    Review of 'FASTA Herder: a web application to trim protein sequence sets'

    FASTA Herder: a web application to trim protein sequence setsCrossref
    Average rating:
        Rated 3.5 of 5.
    Level of importance:
        Rated 3 of 5.
    Level of validity:
        Rated 4 of 5.
    Level of completeness:
        Rated 3 of 5.
    Level of comprehensibility:
        Rated 4 of 5.
    Competing interests:

    Reviewed article

    • Record: found
    • Abstract: found
    • Article: found
    Is Open Access

    FASTA Herder: a web application to trim protein sequence sets

    The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment (MSA) computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://www.ogic.ca/projects/fh/orain.html .

      Review information

      Review text

      The authors present FASTA Herder, a web application for culling protein dataset. The implemented method, which was previously developed by the authors (reference 7), efficiently explores the protein space and identifies redundant sequences that are likely to share a similar domain structure.

      FASTA Herder uses an incremental, greedy strategy common to every clustering algorithm [A] and calculates pairwise-sequence alignments with BLAST. The originality of FASTA Herder consists on the parameters used for the validation of a match between two sequences: a minimum identity threshold and a maximum length difference. The default thresholds for both parameters are optimized according to the length of the compared sequences and aim at the identification of distant homologous sequences that have similar domain composition and thus perform likely to perform similar functions. The web implementation includes an additional option for filtering low complexity regions that can further optimize the clustering process.

      The authors benchmark FASTA Herder against a related tool, PISCES, and show that their method is faster and performs better in terms of number of sequences correctly assigned to their protein family, according to the OrthoBench dataset. It would be interesting to extend such comparison to other widely used algorithms like CD-HIT [B] and kClust [C]. Both are based on a different rationale: they have a short word pre-filtering step to avoid insignificant sequence comparisons and use the Smith-Waterman alignment method. This extended benchmark would clarify the relevance of the alignment length parameter in the correct identification of homologous relationships.

      The authors limit the scope of their method to the reduction of the redundancy within a protein dataset and distinguish it from an orthology prediction approach. However, one of the criteria for the comparison with PISCES is the correct classification within orthologous groups (as defined in OrthoBench). It is not clear if the sequences "miss-classified" by PISCES, which partly explain its higher compression rate, mostly correspond to distant paralogous or analogous sequences with no ancestral relationship to the other sequences within the cluster. A more detailed description of such mis-classified sequences in terms of identity and length might shed light on this issue.

      One last point: expressions like "low levels of homology" should be avoided. Homology cannot be low or high because it refers to the existence of an ancestral relationship between two sequences. Sequence identity should be used instead.

      [A] Hobohm U, Scharf M, Schneider R, Sander C. Selection of representative protein data sets. Protein Sci. 1992 Mar;1(3):409-17. PubMed PMID: 1304348; PubMed Central PMCID: PMC2142204.

      [B] Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010 Mar 1;26(5):680-2. doi:10.1093/bioinformatics/btq003. Epub 2010 Jan 6. PubMed PMID: 20053844; PubMed Central PMCID: PMC2828112.

      [C] Hauser M, Mayer CE, Söding J. kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics. 2013 Aug 15;14:248. doi:10.1186/1471-2105-14-248. PubMed PMID: 23945046; PubMed Central PMCID: PMC3843501.


      Comment on this review