Machine learning algorithm validation with a limited sample size

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.

Related collections

Most cited references 19

Record: found
Abstract: found
Article: not found

LIBSVM: A library for support vector machines

Chih-Chung Chang, Chih-Jen Lin (2011)

LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.

0 comments Cited 2016 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Scikit‐learn: machine learning in python

P Fabian, V. Gael, G. ALEXANDRE … (2011)

0 comments Cited 794 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Bias in error estimation when using cross-validation for model selection

Sudhir Varma, Richard M Simon (2006)

Background Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data. Results We used CV to optimize the classification parameters for two kinds of classifiers; Shrunken Centroids and Support Vector Machines (SVM). Random training datasets were created, with no difference in the distribution of the features between the two classes. Using these "null" datasets, we selected classifier parameter values that minimized the CV error estimate. 10-fold CV was used for Shrunken Centroids while Leave-One-Out-CV (LOOCV) was used for the SVM. Independent test data was created to estimate the true error. With "null" and "non null" (with differential expression between the classes) data, we also tested a nested CV procedure, where an inner CV loop is used to perform the tuning of the parameters while an outer CV is used to compute an estimate of the error. The CV error estimate for the classifier with the optimal parameters was found to be a substantially biased estimate of the true error that the classifier would incur on independent data. Even though there is no real difference between the two classes for the "null" datasets, the CV error estimate for the Shrunken Centroid with the optimal parameters was less than 30% on 18.5% of simulated training data-sets. For SVM with optimal parameters the estimated error rate was less than 30% on 38% of "null" data-sets. Performance of the optimized classifiers on the independent test set was no better than chance. The nested CV procedure reduces the bias considerably and gives an estimate of the error that is very close to that obtained on the independent testing set for both Shrunken Centroids and SVM classifiers for "null" and "non-null" data distributions. Conclusion We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.

0 comments Cited 392 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Andrius Vabalas:

ORCID: http://orcid.org/0000-0002-0659-2890

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: Project administrationRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draft

Emma Gowen: Role: ConceptualizationRole: MethodologyRole: SupervisionRole: Writing – review & editing

Ellen Poliakoff: Role: ConceptualizationRole: MethodologyRole: SupervisionRole: Writing – review & editing

Alexander J. Casson: Role: ConceptualizationRole: Funding acquisitionRole: MethodologyRole: Project administrationRole: SupervisionRole: ValidationRole: Writing – review & editing

Enrique Hernandez-Lemus: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (iso-abbrev): PLoS ONE

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Electronic): 1932-6203

Publication date Collection: 2019

Publication date (Electronic): 7 November 2019

Volume: 14

Issue: 11

Electronic Location Identifier: e0224365

Affiliations

[1 ] Materials, Devices and Systems Division, School of Electrical and Electronic Engineering, The University of Manchester, Manchester, England, United Kingdom

[2 ] School of Biological Sciences, The University of Manchester, Manchester, England, United Kingdom

Instituto Nacional de Medicina Genomica, MEXICO

Author notes

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: andrius.vabalas@ 123456manchester.ac.uk

Author information

Andrius Vabalas http://orcid.org/0000-0002-0659-2890

Article

Publisher ID: PONE-D-19-13163

DOI: 10.1371/journal.pone.0224365

PMC ID: 6837442

PubMed ID: 31697686

SO-VID: 9cc69fdd-53ab-4ec1-b0c2-8f5129d60b22

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 9 May 2019

Date accepted : 12 October 2019

Page count

Figures: 8, Tables: 1, Pages: 20

Funding

Funded by: UK Engineering and Physical Sciences Research Council and its Doctoral Training Partnership with the University of Manchester

Award ID: EP/m507969/1

Award Recipient :

ORCID: http://orcid.org/0000-0002-0659-2890

Andrius Vabalas

EG, EP and AJC hold academic positions at the University of Manchester. AV was supported by the UK Engineering and Physical Sciences Research Council (website: https://epsrc.ukri.org/) and its Doctoral Training Partnership with the University of Manchester (ref.: EP/m507969/1). The funders did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional internal or external funding received for this study.

Custom metadata

Data Availability All relevant data are within the manuscript and its supporting Information files.

Machine learning algorithm validation with a limited sample size

Read this article at

Abstract

Related collections

Annual Reviews AI, Machine Learning, and Society

Most cited references 19

LIBSVM: A library for support vector machines

Scikit‐learn: machine learning in python

Bias in error estimation when using cross-validation for model selection

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 35

Cited by 308

Most referenced authors 1,369