172
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection.

          Methods

          Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided.

          Results

          The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate.

          Conclusions

          Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s13321-014-0047-1) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references64

          • Record: found
          • Abstract: found
          • Book: not found

          An Introduction to the Bootstrap

          Statistics is a subject of many uses and surprisingly few effective practitioners. The traditional road to statistical knowledge is blocked, for most, by a formidable wall of mathematics. The approach in An Introduction to the Bootstrap avoids that wall. It arms scientists and engineers, as well as statisticians, with the computational techniques they need to analyze and understand complicated data sets.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Sparsity and smoothness via the fused lasso

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              R: A Language and environmental for statistical computing

                Bookmark

                Author and article information

                Contributors
                d.baumann@tu-braunschweig.de
                k.baumann@tu-braunschweig.de
                Journal
                J Cheminform
                J Cheminform
                Journal of Cheminformatics
                Springer International Publishing (Cham )
                1758-2946
                26 November 2014
                26 November 2014
                2014
                : 6
                : 1
                : 47
                Affiliations
                Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, D-38106 Braunschweig, Germany
                Article
                47
                10.1186/s13321-014-0047-1
                4260165
                25506400
                d62e7871-9b77-4675-af44-70e647d5ec92
                © Baumann and Baumann; licensee Chemistry Central Ltd. 2014

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 8 July 2014
                : 30 October 2014
                Categories
                Research Article
                Custom metadata
                © The Author(s) 2014

                Chemoinformatics
                cross-validation,double cross-validation,internal validation,external validation,prediction error,regression

                Comments

                Comment on this article