+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          While a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 amino acid descriptor sets have been benchmarked with respect to their ability of establishing bioactivity models. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM, a novel protein descriptor set (termed ProtFP (4 variants)), and in addition we created and benchmarked three pairs of descriptor combinations. Prediction performance was evaluated in seven structure-activity benchmarks which comprise Angiotensin Converting Enzyme (ACE) dipeptidic inhibitor data, and three proteochemometric data sets, namely (1) GPCR ligands modeled against a GPCR panel, (2) enzyme inhibitors (NNRTIs) with associated bioactivities against a set of HIV enzyme mutants, and (3) enzyme inhibitors (PIs) with associated bioactivities on a large set of HIV enzyme mutants.


          The amino acid descriptor sets compared here show similar performance (<0.1 log units RMSE difference and <0.1 difference in MCC), while errors for individual proteins were in some cases found to be larger than those resulting from descriptor set differences ( > 0.3 log units RMSE difference and >0.7 difference in MCC). Combining different descriptor sets generally leads to better modeling performance than utilizing individual sets. The best performers were Z-scales (3) combined with ProtFP (Feature), or Z-Scales (3) combined with an average Z-Scale value for each target, while ProtFP (PCA8), ST-Scales, and ProtFP (Feature) rank last.


          While amino acid descriptor sets capture different aspects of amino acids their ability to be used for bioactivity modeling is still – on average – surprisingly similar. Still, combining sets describing complementary information consistently leads to small but consistent improvement in modeling performance (average MCC 0.01 better, average RMSE 0.01 log units lower). Finally, performance differences exist between the targets compared thereby underlining that choosing an appropriate descriptor set is of fundamental for bioactivity modeling, both from the ligand- as well as the protein side.

          Related collections

          Most cited references49

          • Record: found
          • Abstract: not found
          • Article: not found

          Individual Comparisons by Ranking Methods

            • Record: found
            • Abstract: found
            • Article: not found

            Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

            Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
              • Record: found
              • Abstract: found
              • Article: not found

              New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids.

              In this study 87 amino acids (AA.s) have been characterized by 26 physicochemical descriptor variables. These descriptor variables include experimentally determined retention values in seven thin-layer chromatography (TLC) systems, three nuclear magnetic resonance (NMR) shift variables, and 16 calculated variables, namely six semiempirical molecular orbital indices, total, polar, and nonpolar surface area, van der Waals volume of the side chain, log P, molecular weight, and four indicator variables describing hydrogen bond donor and acceptor properties, and side chain charge. In the present study, the data from a previous characterization of 55 AA.s from our laboratory have been extended with data for 32 additional AA.s and 14 new descriptor variables. The new 32 AA.s were selected to represent both intermediate and more extreme physicochemical properties, compared to the 20 coded AA.s. The new extended and updated principal property scales, the z-scales, were calculated and aligned to previously reported z(old)-scales. The appropriateness of the extended z-scales were validated by the use in quantitative sequence-activity modeling (QSAM) of 89 elastase substrate analogues and in a QSAM of 29 neurotensin analogues.

                Author and article information

                J Cheminform
                J Cheminform
                Journal of Cheminformatics
                BioMed Central
                24 September 2013
                : 5
                : 42
                [1 ]Division of Medicinal Chemistry, Leiden / Amsterdam Center for Drug Research, Einsteinweg 55, Leiden 2333, CC, The Netherlands
                [2 ]Structural Biology and Chemistry Department, Unité de Bioinformatique Structurale, Institut Pasteur and CNRS URA 2185, 25-28, rue du Dr. Roux, Paris 75 724, France
                [3 ]Tibotec BVBA, Turnhoutseweg 30, Beerse 2340, Belgium
                [4 ]ChEMBL Group, European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton CB10 1SD, United Kingdom
                [5 ]Department of Chemistry, Unilever Centre for Molecular Science Informatics, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
                Copyright © 2013 van Westen et al.; licensee Chemistry Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                Research Article

                gpcr,hiv,qsam,peptides,amino acid index,protein descriptors,polypharmacology
                gpcr, hiv, qsam, peptides, amino acid index, protein descriptors, polypharmacology


                Comment on this article