42
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Basic validation procedures for regression models in QSAR and QSPR studies: theory and application

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Four quantitative structure-activity relationships (QSAR) and quantitative structure-property relationship (QSPR) data sets were selected from the literature and used to build regression models with 75, 56, 50 and 15 training samples. The models were validated by leave-one-out crossvalidation, leave-N-out crossvalidation (LNO), external validation, y-randomization and bootstrapping. Validations have shown that the size of the training sets is the crucial factor in determining model performance, which deteriorates as the data set becomes smaller. Models from very small data sets suffer from the impossibility of being thoroughly validated, failure and atypical behavior in certain validations (chance correlation, lack of robustness to resampling and LNO), regardless of their good performance in leave-one-out crossvalidation, fitting and even in external validation. A simple determination of the critical Nin LNO has been introduced by using the limit of 0.1 for oscillations in Q², quantified as the variation range in single LNO and two standard deviations in multiple LNO. It has been demonstrated that it is sufficient to perform 10 -25 y-randomization and bootstrap runs for a typical model validation. The bootstrap schemes based on hierarchical cluster analysis give more reliable and reasonable results than bootstraps relying only on randomization of the complete data set. Data quality in terms of statistical significance of descriptor -yrelationships is the second important factor for model performance.Variable selection that does not eliminate insignificant descriptor - yrelationships may lead to situations in which they are not detected during model validation, especially when dealing with large data sets.

          Translated abstract

          Quatro conjuntos de dados de QSAR e QSPR foram selecionados da literatura e os modelos de regressão foram construídos com 75, 56, 50 e 15 amostras no conjunto de treinamento. Estes modelos foram validados por meio de validação cruzada excluindo uma amostra de cada vez, validação cruzada excluindo Namostras de cada vez (LNO), validação externa, randomização do vetor ye validação bootstrap. Os resultados das validações mostraram que o tamanho do conjunto de treinamento é o fator principal para o bom desempenho de um modelo, uma vez que este piora para os conjuntos de dados pequenos. Modelos oriundos de conjuntos de dados muito pequenos não podem ser testados em toda a sua extensão. Além disto, eles podem falhar e apresentar comportamento atípico em alguns dos testes de validação (como, por exemplo, correlações espúrias, falta de robustez na reamostragem e na validação cruzada), mesmo tendo apresentado um bom desempenho na validação cruzada excluindo uma amostra, no ajuste e até na validação externa. Uma maneira simples de determinar o valor crítico de Nem LNO foi introduzida, usando o valor limite de 0,1 para oscilações em Q² (faixa de variações em único LNO e dois desvios padrões em LNO múltiplo). Foi mostrado que 10 - 25 ciclos de randomização de you de bootstrappingsão suficientes para uma validação típica. O uso do método bootstrapbaseado na análise de agrupamentos por métodos hierárquicos fornece resultados mais confiáveis e razoáveis do que aqueles baseados somente na randomização do conjunto de dados completo. A qualidade de dados em termos de significância estatística das relações descritor - yé o segundo fator mais importante para o desempenho do modelo. Uma seleção de variáveis em que as relações insignificantes não foram eliminadas pode conduzir a situações nas quais elas não serão detectadas durante o processo de validação do modelo, especialmente quando o conjunto de dados for grande.

          Related collections

          Most cited references22

          • Record: found
          • Abstract: not found
          • Book: not found

          Multi-Way Analysis with Applications in the Chemical Sciences

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Accurate Binding Free Energy Predictions in Fragment Optimization.

            Predicting protein-ligand binding free energies is a central aim of computational structure-based drug design (SBDD)--improved accuracy in binding free energy predictions could significantly reduce costs and accelerate project timelines in lead discovery and optimization. The recent development and validation of advanced free energy calculation methods represents a major step toward this goal. Accurately predicting the relative binding free energy changes of modifications to ligands is especially valuable in the field of fragment-based drug design, since fragment screens tend to deliver initial hits of low binding affinity that require multiple rounds of synthesis to gain the requisite potency for a project. In this study, we show that a free energy perturbation protocol, FEP+, which was previously validated on drug-like lead compounds, is suitable for the calculation of relative binding strengths of fragment-sized compounds as well. We study several pharmaceutically relevant targets with a total of more than 90 fragments and find that the FEP+ methodology, which uses explicit solvent molecular dynamics and physics-based scoring with no parameters adjusted, can accurately predict relative fragment binding affinities. The calculations afford R(2)-values on average greater than 0.5 compared to experimental data and RMS errors of ca. 1.1 kcal/mol overall, demonstrating significant improvements over the docking and MM-GBSA methods tested in this work and indicating that FEP+ has the requisite predictive power to impact fragment-based affinity optimization projects.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              FESetup: Automating Setup for Alchemical Free Energy Simulations.

              FESetup is a new pipeline tool which can be used flexibly within larger workflows. The tool aims to support fast and easy setup of alchemical free energy simulations for molecular simulation packages such as AMBER, GROMACS, Sire, or NAMD. Post-processing methods like MM-PBSA and LIE can be set up as well. Ligands are automatically parametrized with AM1-BCC, and atom mappings for a single topology description are computed with a maximum common substructure search (MCSS) algorithm. An abstract molecular dynamics (MD) engine can be used for equilibration prior to free energy setup or standalone. Currently, all modern AMBER force fields are supported. Ease of use, robustness of the code, and automation where it is feasible are the main development goals. The project follows an open development model, and we welcome contributions.
                Bookmark

                Author and article information

                Journal
                jbchs
                Journal of the Brazilian Chemical Society
                J. Braz. Chem. Soc.
                Sociedade Brasileira de Química (São Paulo, SP, Brazil )
                0103-5053
                1678-4790
                2009
                : 20
                : 4
                : 770-787
                Affiliations
                [01] Campinas SP orgnameState University of Campinas orgdiv1Institute of Chemistry orgdiv2Laboratory for Theoretical and Applied Chemometrics Brazil
                Article
                S0103-50532009000400021 S0103-5053(09)02000421
                10.1590/S0103-50532009000400021
                83e3d675-2db6-4abc-9e51-c2a3d758aeeb

                This work is licensed under a Creative Commons Attribution 4.0 International License.

                History
                : 06 May 2009
                : 24 November 2008
                Page count
                Figures: 0, Tables: 0, Equations: 0, References: 63, Pages: 18
                Product

                SciELO Brazil

                Categories
                Articles

                external validation,leave-one-out crossvalidation,leave-N-out crossvalidation,y-randomization,bootstrapping

                Comments

                Comment on this article