The authors present FASTA Herder, a web application for culling protein dataset. The implemented method, which was previously developed by the authors (reference 7), efficiently explores the protein space and identifies redundant sequences that are likely to share a similar domain structure.
FASTA Herder uses an incremental, greedy strategy common to every clustering algorithm [A] and calculates pairwise-sequence alignments with BLAST. The originality of FASTA Herder consists on the parameters used for the validation of a match between two sequences: a minimum identity threshold and a maximum length difference. The default thresholds for both parameters are optimized according to the length of the compared sequences and aim at the identification of distant homologous sequences that have similar domain composition and thus perform likely to perform similar functions. The web implementation includes an additional option for filtering low complexity regions that can further optimize the clustering process.
The authors benchmark FASTA Herder against a related tool, PISCES, and show that their method is faster and performs better in terms of number of sequences correctly assigned to their protein family, according to the OrthoBench dataset. It would be interesting to extend such comparison to other widely used algorithms like CD-HIT [B] and kClust [C]. Both are based on a different rationale: they have a short word pre-filtering step to avoid insignificant sequence comparisons and use the Smith-Waterman alignment method. This extended benchmark would clarify the relevance of the alignment length parameter in the correct identification of homologous relationships.
The authors limit the scope of their method to the reduction of the redundancy within a protein dataset and distinguish it from an orthology prediction approach. However, one of the criteria for the comparison with PISCES is the correct classification within orthologous groups (as defined in OrthoBench). It is not clear if the sequences "miss-classified" by PISCES, which partly explain its higher compression rate, mostly correspond to distant paralogous or analogous sequences with no ancestral relationship to the other sequences within the cluster. A more detailed description of such mis-classified sequences in terms of identity and length might shed light on this issue.
One last point: expressions like "low levels of homology" should be avoided. Homology cannot be low or high because it refers to the existence of an ancestral relationship between two sequences. Sequence identity should be used instead.
[A] Hobohm U, Scharf M, Schneider R, Sander C. Selection of representative protein data sets. Protein Sci. 1992 Mar;1(3):409-17. PubMed PMID: 1304348; PubMed Central PMCID: PMC2142204.
[B] Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010 Mar 1;26(5):680-2. doi:10.1093/bioinformatics/btq003. Epub 2010 Jan 6. PubMed PMID: 20053844; PubMed Central PMCID: PMC2828112.
[C] Hauser M, Mayer CE, Söding J. kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics. 2013 Aug 15;14:248. doi:10.1186/1471-2105-14-248. PubMed PMID: 23945046; PubMed Central PMCID: PMC3843501.