The authors of this article have identified a real and growing
difficulty in identifying and comparing homologous sequences -- the
number of sequences is growing at such an alarming rate that trimming
the set of results is necessary to make appropriate use of the data.
The paper proposes that using full length alignments with stringent
requirements on pairwise alignment coverage, but a weak
percent-identity threshold (ratio aligned amino acids that are
identical to amino acid positions that participate in the alignment)
is useful for identifying orthlogous sequences.
By its nature, FASTA Herder is limited to full length sequences of
high quality. Both of these are limitation that for de novo discovery
would have to be addressed, though these issues could properly be seen as
topics for further research. Some protein database sequences are
high-quality but partial, whereas others are full length but of
questionable quality. Of particular difficultly to FASTA Herder could
be retained introns. It is also not obvious to me, one way or the
other, whether FASTA Herder is or could be useful at a lengths shorter
than the full protein, e.g. at classifying sequences at the domain
level.
For a common use case -- the case in which one has a preferred query,
e.g. human, in mind -- presumably one could use the preferred protein
as the representative of its group. The authors did not directly
address this.
The authors test against OrthoBench, which has "1677 proteins from 12
Metazoa species grouped into 70 protein families". Performing such a
test demonstrates the plausibility of their approach. And the use of
full length alignments is quite interesting. It is fast, conceptually
simple (to align in such a way, the domain structure would have to
line up) and would appear to have low risk of overfitting. It is
intriguing that they do well at low percent identity. The limitation
is that this a small scale test, using well-curated sequences. The
test does not directly address the stated problem -- the filtering of
search results from searches of large databases. The authors state
"the number of mistakes could be expected to be smaller than in
real-life applications". So, the results are suggestive and
encouraging, but not comprehensive.
I have rated "level of comprehensibility" a bit low for a simple
reason. The paper has several supplementary tables, but no main
figure or table at all, much less one that summarizes the main results
in a comprehensible way. This was, I think, a poor choice and if this
situation occurred in a pre-reviewed rather than continuously reviewed
paper, I would ask for a revision that addressed the issue. If the
journal pushed for all figures and tables to be in supplementary, then
the journal has made a poor choice. Nonetheless, the supplementary
tables do not provide a well-presented brief summary of the results.