10
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future.

          Related collections

          Most cited references24

          • Record: found
          • Abstract: found
          • Article: not found

          A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking.

          Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for analysing the outputs of molecular docking, which in turn is an important technique for drug discovery, chemical biology and structural biology. Each scoring function assumes a predetermined theory-inspired functional form for the relationship between the variables that characterize the complex, which also include parameters fitted to experimental or simulation data and its predicted binding affinity. The inherent problem of this rigid approach is that it leads to poor predictivity for those complexes that do not conform to the modelling assumptions. Moreover, resampling strategies, such as cross-validation or bootstrapping, are still not systematically used to guard against the overfitting of calibration data in parameter estimation for scoring functions. We propose a novel scoring function (RF-Score) that circumvents the need for problematic modelling assumptions via non-parametric machine learning. In particular, Random Forest was used to implicitly capture binding effects that are hard to model explicitly. RF-Score is compared with the state of the art on the demanding PDBbind benchmark. Results show that RF-Score is a very competitive scoring function. Importantly, RF-Score's performance was shown to improve dramatically with training set size and hence the future availability of more high-quality structural and interaction data is expected to lead to improved versions of RF-Score. pedro.ballester@ebi.ac.uk; jbom@st-andrews.ac.uk Supplementary data are available at Bioinformatics online.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Comparative assessment of scoring functions on a diverse test set.

            Scoring functions are widely applied to the evaluation of protein-ligand binding in structure-based drug design. We have conducted a comparative assessment of 16 popular scoring functions implemented in main-stream commercial software or released by academic research groups. A set of 195 diverse protein-ligand complexes with high-resolution crystal structures and reliable binding constants were selected through a systematic nonredundant sampling of the PDBbind database and used as the primary test set in our study. All scoring functions were evaluated in three aspects, that is, "docking power", "ranking power", and "scoring power", and all evaluations were independent from the context of molecular docking or virtual screening. As for "docking power", six scoring functions, including GOLD::ASP, DS::PLP1, DrugScore(PDB), GlideScore-SP, DS::LigScore, and GOLD::ChemScore, achieved success rates over 70% when the acceptance cutoff was root-mean-square deviation < 2.0 A. Combining these scoring functions into consensus scoring schemes improved the success rates to 80% or even higher. As for "ranking power" and "scoring power", the top four scoring functions on the primary test set were X-Score, DrugScore(CSD), DS::PLP, and SYBYL::ChemScore. They were able to correctly rank the protein-ligand complexes containing the same type of protein with success rates around 50%. Correlation coefficients between the experimental binding constants and the binding scores computed by these scoring functions ranged from 0.545 to 0.644. Besides the primary test set, each scoring function was also tested on four additional test sets, each consisting of a certain number of protein-ligand complexes containing one particular type of protein. Our study serves as an updated benchmark for evaluating the general performance of today's scoring functions. Our results indicate that no single scoring function consistently outperforms others in all three aspects. Thus, it is important in practice to choose the appropriate scoring functions for different purposes.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              NNScore 2.0: A Neural-Network Receptor–Ligand Scoring Function

              NNScore is a neural-network-based scoring function designed to aid the computational identification of small-molecule ligands. While the test cases included in the original NNScore article demonstrated the utility of the program, the application examples were limited. The purpose of the current work is to further confirm that neural-network scoring functions are effective, even when compared to the scoring functions of state-of-the-art docking programs, such as AutoDock, the most commonly cited program, and AutoDock Vina, thought to be two orders of magnitude faster. Aside from providing additional validation of the original NNScore function, we here present a second neural-network scoring function, NNScore 2.0. NNScore 2.0 considers many more binding characteristics when predicting affinity than does the original NNScore. The network output of NNScore 2.0 also differs from that of NNScore 1.0; rather than a binary classification of ligand potency, NNScore 2.0 provides a single estimate of the pK d. To facilitate use, NNScore 2.0 has been implemented as an open-source python script. A copy can be obtained from http://www.nbcr.net/software/nnscore/.
                Bookmark

                Author and article information

                Journal
                Biomolecules
                Biomolecules
                biomolecules
                Biomolecules
                MDPI
                2218-273X
                14 March 2018
                March 2018
                : 8
                : 1
                : 12
                Affiliations
                [1 ]SDIVF R&D Centre, Hong Kong Science Park, Sha Tin, New Territories, Hong Kong, China; jackyleehongjian@ 123456gmail.com
                [2 ]Institute of Future Cities, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China; andrew.pengjj@ 123456gmail.com (J.P.); yeeleung@ 123456cuhk.edu.hk (Y.L.); ksleung@ 123456cse.cuhk.edu.hk (K.-S.L.)
                [3 ]Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China; mhwong@ 123456cse.cuhk.edu.hk
                [4 ]School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China
                [5 ]School of Biomedical Sciences, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong, China; lugang@ 123456cuhk.edu.hk
                [6 ]Cancer Research Center of Marseille, INSERM U1068, F-13009 Marseille, France
                [7 ]Institut Paoli-Calmettes, F-13009 Marseille, France
                [8 ]Aix-Marseille Université, F-13284 Marseille, France
                [9 ]CNRS UMR7258, F-13009 Marseille, France
                Author notes
                Author information
                https://orcid.org/0000-0001-8467-638X
                Article
                biomolecules-08-00012
                10.3390/biom8010012
                5871981
                29538331
                b6b9210c-3d31-4421-a00f-294e1523ac12
                © 2018 by the authors.

                Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).

                History
                : 08 February 2018
                : 12 March 2018
                Categories
                Article

                machine learning,scoring function,molecular docking,binding affinity prediction

                Comments

                Comment on this article