Because the protein's function is usually related to its subcellular localization,
the ability to predict subcellular localization directly from protein sequences will
be useful for inferring protein functions. Recent years have seen a surging interest
in the development of novel computational tools to predict subcellular localization.
At present, these approaches, based on a wide range of algorithms, have achieved varying
degrees of success for specific organisms and for certain localization categories.
A number of authors have noticed that sequence similarity is useful in predicting
subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847)
have carried out extensive analysis of the relation between sequence similarity and
identity in subcellular localization, and have found a close relationship between
them above a certain similarity threshold. However, many existing benchmark data sets
used for the prediction accuracy assessment contain highly homologous sequences-some
data sets comprising sequences up to 80-90% sequence identity. Using these benchmark
test data will surely lead to overestimation of the performance of the methods considered.
Here, we develop an approach based on a two-level support vector machine (SVM) system:
the first level comprises a number of SVM classifiers, each based on a specific type
of feature vectors derived from sequences; the second level SVM classifier functions
as the jury machine to generate the probability distribution of decisions for possible
localizations. We compare our approach with a global sequence alignment approach and
other existing approaches for two benchmark data sets-one comprising prokaryotic sequences
and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence
alignment for several data sets to investigate the relationship between sequence homology
and subcellular localization. Our results, which are consistent with previous studies,
indicate that the homology search approach performs well down to 30% sequence identity,
although its performance deteriorates considerably for sequences sharing lower sequence
identity. A data set of high homology levels will undoubtedly lead to biased assessment
of the performances of the predictive approaches-especially those relying on homology
search or sequence annotations. Our two-level classification system based on SVM does
not rely on homology search; therefore, its performance remains relatively unaffected
by sequence homology. When compared with other approaches, our approach performed
significantly better. Furthermore, we also develop a practical hybrid method, which
combines the two-level SVM classifier and the homology search method, as a general
tool for the sequence annotation of subcellular localization.