On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on joint X– Y distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.

Electronic supplementary material

The online version of this article (10.1007/s41664-018-0068-2) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 21

Record: found
Abstract: not found
Article: not found

Computer Aided Design of Experiments

R. Kennard, L. A. Stone (1969)

0 comments Cited 376 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

A study of cross-validation and bootstrap for accuracy estimation and model selection in

R Kohavi, Kohavi, Ron Kohavi (1995)

0 comments Cited 266 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding.

Piotr S. Gromski, Howbeer Muhamadali, David Ellis … (2015)

The predominance of partial least squares-discriminant analysis (PLS-DA) used to analyze metabolomics datasets (indeed, it is the most well-known tool to perform classification and regression in metabolomics), can be said to have led to the point that not all researchers are fully aware of alternative multivariate classification algorithms. This may in part be due to the widespread availability of PLS-DA in most of the well-known statistical software packages, where its implementation is very easy if the default settings are used. In addition, one of the perceived advantages of PLS-DA is that it has the ability to analyze highly collinear and noisy data. Furthermore, the calibration model is known to provide a variety of useful statistics, such as prediction accuracy as well as scores and loadings plots. However, this method may provide misleading results, largely due to a lack of suitable statistical validation, when used by non-experts who are not aware of its potential limitations when used in conjunction with metabolomics. This tutorial review aims to provide an introductory overview to several straightforward statistical methods such as principal component-discriminant function analysis (PC-DFA), support vector machines (SVM) and random forests (RF), which could very easily be used either to augment PLS or as alternative supervised learning methods to PLS-DA. These methods can be said to be particularly appropriate for the analysis of large, highly-complex data sets which are common output(s) in metabolomics studies where the numbers of variables often far exceed the number of samples. In addition, these alternative techniques may be useful tools for generating parsimonious models through feature selection and data reduction, as well as providing more propitious results. We sincerely hope that the general reader is left with little doubt that there are several promising and readily available alternatives to PLS-DA, to analyze large and highly complex data sets.

0 comments Cited 246 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Yun Xu:

ORCID: http://orcid.org/0000-0003-3228-5111

+44-161-306-5145 , yun.xu-2@manchester.ac.uk

Journal

Journal ID (nlm-ta): J Anal Test

Journal ID (iso-abbrev): J Anal Test

Title: Journal of Analysis and Testing

Publisher: Springer Singapore (Singapore )

ISSN (Print): 2096-241X

ISSN (Electronic): 2509-4696

Publication date (Electronic): 29 October 2018

Publication date PMC-release: 29 October 2018

Publication date (Print): 2018

Volume: 2

Issue: 3

Pages: 249-262

Affiliations

[1 ]ISNI 0000000121662407, GRID grid.5379.8, School of Chemistry, , Manchester Institute of Biotechnology, The University of Manchester, ; Manchester, M1 7DN UK

[2 ]ISNI 0000 0004 1936 8470, GRID grid.10025.36, Department of Biochemistry, Institute of Integrative Biology, , University of Liverpool, ; Biosciences Building, Crown Street, Liverpool, L69 7ZB UK

Author information

Yun Xu http://orcid.org/0000-0003-3228-5111

Royston Goodacre http://orcid.org/0000-0003-2230-645X

Article

Publisher ID: 68

DOI: 10.1007/s41664-018-0068-2

PMC ID: 6373628

PubMed ID: 30842888

SO-VID: 08e04999-748c-4a31-aabf-6b3d2b55c9b5

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

History

Date received : 4 June 2018

Date revision received : 8 October 2018

Date accepted : 12 October 2018

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100004440, Wellcome Trust;

Award ID: 202952/Z/16/Z

Award Recipient : Royston Goodacre

Custom metadata

Keywords: cross-validation,bootstrapping,bootstrapped latin partition,kennard-stone algorithm,spxy,model selection,model validation,partial least squares for discriminant analysis,support vector machines

Data availability:

Keywords: cross-validation, bootstrapping, bootstrapped latin partition, kennard-stone algorithm, spxy, model selection, model validation, partial least squares for discriminant analysis, support vector machines

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

Read this article at

Abstract

Electronic supplementary material

Related collections

Core Readings in Statistical Mediation Analysis

Most cited references 21

Computer Aided Design of Experiments

A study of cross-validation and bootstrap for accuracy estimation and model selection in

A tutorial review: Metabolomics and partial least squares-discriminant analysis--a marriage of convenience or a shotgun wedding.

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 53

Cited by 123

Most referenced authors 331