33
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Machine learning algorithm validation with a limited sample size

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

          Related collections

          Most cited references19

          • Record: found
          • Abstract: found
          • Article: not found

          LIBSVM: A library for support vector machines

          LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Scikit‐learn: machine learning in python

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Bias in error estimation when using cross-validation for model selection

              Background Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data. Results We used CV to optimize the classification parameters for two kinds of classifiers; Shrunken Centroids and Support Vector Machines (SVM). Random training datasets were created, with no difference in the distribution of the features between the two classes. Using these "null" datasets, we selected classifier parameter values that minimized the CV error estimate. 10-fold CV was used for Shrunken Centroids while Leave-One-Out-CV (LOOCV) was used for the SVM. Independent test data was created to estimate the true error. With "null" and "non null" (with differential expression between the classes) data, we also tested a nested CV procedure, where an inner CV loop is used to perform the tuning of the parameters while an outer CV is used to compute an estimate of the error. The CV error estimate for the classifier with the optimal parameters was found to be a substantially biased estimate of the true error that the classifier would incur on independent data. Even though there is no real difference between the two classes for the "null" datasets, the CV error estimate for the Shrunken Centroid with the optimal parameters was less than 30% on 18.5% of simulated training data-sets. For SVM with optimal parameters the estimated error rate was less than 30% on 38% of "null" data-sets. Performance of the optimized classifiers on the independent test set was no better than chance. The nested CV procedure reduces the bias considerably and gives an estimate of the error that is very close to that obtained on the independent testing set for both Shrunken Centroids and SVM classifiers for "null" and "non-null" data distributions. Conclusion We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: Project administrationRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draft
                Role: ConceptualizationRole: MethodologyRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: MethodologyRole: SupervisionRole: Writing – review & editing
                Role: ConceptualizationRole: Funding acquisitionRole: MethodologyRole: Project administrationRole: SupervisionRole: ValidationRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                2019
                7 November 2019
                : 14
                : 11
                : e0224365
                Affiliations
                [1 ] Materials, Devices and Systems Division, School of Electrical and Electronic Engineering, The University of Manchester, Manchester, England, United Kingdom
                [2 ] School of Biological Sciences, The University of Manchester, Manchester, England, United Kingdom
                Instituto Nacional de Medicina Genomica, MEXICO
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Author information
                http://orcid.org/0000-0002-0659-2890
                Article
                PONE-D-19-13163
                10.1371/journal.pone.0224365
                6837442
                31697686
                9cc69fdd-53ab-4ec1-b0c2-8f5129d60b22
                © 2019 Vabalas et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 9 May 2019
                : 12 October 2019
                Page count
                Figures: 8, Tables: 1, Pages: 20
                Funding
                Funded by: UK Engineering and Physical Sciences Research Council and its Doctoral Training Partnership with the University of Manchester
                Award ID: EP/m507969/1
                Award Recipient :
                EG, EP and AJC hold academic positions at the University of Manchester. AV was supported by the UK Engineering and Physical Sciences Research Council (website: https://epsrc.ukri.org/) and its Doctoral Training Partnership with the University of Manchester (ref.: EP/m507969/1). The funders did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional internal or external funding received for this study.
                Categories
                Research Article
                Physical Sciences
                Mathematics
                Statistics
                Statistical Noise
                Gaussian Noise
                Biology and Life Sciences
                Psychology
                Developmental Psychology
                Pervasive Developmental Disorders
                Autism Spectrum Disorder
                Autism
                Social Sciences
                Psychology
                Developmental Psychology
                Pervasive Developmental Disorders
                Autism Spectrum Disorder
                Autism
                Biology and Life Sciences
                Neuroscience
                Developmental Neuroscience
                Neurodevelopmental Disorders
                Autism
                Medicine and Health Sciences
                Neurology
                Neurodevelopmental Disorders
                Autism
                Research and Analysis Methods
                Imaging Techniques
                Neuroimaging
                Biology and Life Sciences
                Neuroscience
                Neuroimaging
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Physical Sciences
                Mathematics
                Operator Theory
                Kernel Functions
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Physical Sciences
                Mathematics
                Probability Theory
                Probability Distribution
                Normal Distribution
                Biology and Life Sciences
                Neuroscience
                Cognitive Science
                Cognitive Psychology
                Learning
                Learning Curves
                Biology and Life Sciences
                Psychology
                Cognitive Psychology
                Learning
                Learning Curves
                Social Sciences
                Psychology
                Cognitive Psychology
                Learning
                Learning Curves
                Biology and Life Sciences
                Neuroscience
                Learning and Memory
                Learning
                Learning Curves
                Custom metadata
                All relevant data are within the manuscript and its supporting Information files.

                Uncategorized
                Uncategorized

                Comments

                Comment on this article