15
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Introduction

          Metabolomics is increasingly being used in the clinical setting for disease diagnosis, prognosis and risk prediction. Machine learning algorithms are particularly important in the construction of multivariate metabolite prediction. Historically, partial least squares (PLS) regression has been the gold standard for binary classification. Nonlinear machine learning methods such as random forests (RF), kernel support vector machines (SVM) and artificial neural networks (ANN) may be more suited to modelling possible nonlinear metabolite covariance, and thus provide better predictive models.

          Objectives

          We hypothesise that for binary classification using metabolomics data, non-linear machine learning methods will provide superior generalised predictive ability when compared to linear alternatives, in particular when compared with the current gold standard PLS discriminant analysis.

          Methods

          We compared the general predictive performance of eight archetypal machine learning algorithms across ten publicly available clinical metabolomics data sets. The algorithms were implemented in the Python programming language. All code and results have been made publicly available as Jupyter notebooks.

          Results

          There was only marginal improvement in predictive ability for SVM and ANN over PLS across all data sets. RF performance was comparatively poor. The use of out-of-bag bootstrap confidence intervals provided a measure of uncertainty of model prediction such that the quality of metabolomics data was observed to be a bigger influence on generalised performance than model choice.

          Conclusion

          The size of the data set, and choice of performance metric, had a greater influence on generalised predictive performance than the choice of machine learning algorithm.

          Electronic supplementary material

          The online version of this article (10.1007/s11306-019-1612-4) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references23

          • Record: found
          • Abstract: found
          • Article: not found

          MassBank: a public repository for sharing mass spectral data for life sciences.

          MassBank is the first public repository of mass spectra of small chemical compounds for life sciences (<3000 Da). The database contains 605 electron-ionization mass spectrometry (EI-MS), 137 fast atom bombardment MS and 9276 electrospray ionization (ESI)-MS(n) data of 2337 authentic compounds of metabolites, 11 545 EI-MS and 834 other-MS data of 10,286 volatile natural and synthetic compounds, and 3045 ESI-MS(2) data of 679 synthetic drugs contributed by 16 research groups (January 2010). ESI-MS(2) data were analyzed under nonstandardized, independent experimental conditions. MassBank is a distributed database. Each research group provides data from its own MassBank data servers distributed on the Internet. MassBank users can access either all of the MassBank data or a subset of the data by specifying one or more experimental conditions. In a spectral search to retrieve mass spectra similar to a query mass spectrum, the similarity score is calculated by a weighted cosine correlation in which weighting exponents on peak intensity and the mass-to-charge ratio are optimized to the ESI-MS(2) data. MassBank also provides a merged spectrum for each compound prepared by merging the analyzed ESI-MS(2) data on an identical compound under different collision-induced dissociation conditions. Data merging has significantly improved the precision of the identification of a chemical compound by 21-23% at a similarity score of 0.6. Thus, MassBank is useful for the identification of chemical compounds and the publication of experimental data. 2010 John Wiley & Sons, Ltd.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            METLIN: a metabolite mass spectral database.

            Endogenous metabolites have gained increasing interest over the past 5 years largely for their implications in diagnostic and pharmaceutical biomarker discovery. METLIN (http://metlin.scripps.edu), a freely accessible web-based data repository, has been developed to assist in a broad array of metabolite research and to facilitate metabolite identification through mass analysis. METLINincludes an annotated list of known metabolite structural information that is easily cross-correlated with its catalogue of high-resolution Fourier transform mass spectrometry (FTMS) spectra, tandem mass spectrometry (MS/MS) spectra, and LC/MS data.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Universal Approximation Using Radial-Basis-Function Networks

                Bookmark

                Author and article information

                Contributors
                +61 (0)8-6304-2705 , d.broadhurst@ecu.edu.au
                Journal
                Metabolomics
                Metabolomics
                Metabolomics
                Springer US (New York )
                1573-3882
                1573-3890
                15 November 2019
                15 November 2019
                2019
                : 15
                : 12
                : 150
                Affiliations
                ISNI 0000 0004 0389 4302, GRID grid.1038.a, Centre for Metabolomics & Computational Biology, School of Science, Edith Cowan University, ; Joondalup, 6027 Australia
                Author information
                http://orcid.org/0000-0002-8832-2607
                http://orcid.org/0000-0002-0758-0330
                http://orcid.org/0000-0003-0775-9581
                Article
                1612
                10.1007/s11306-019-1612-4
                6856029
                31728648
                fc939a65-b9b0-4830-8918-1634ad04eb76
                © The Author(s) 2019

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

                History
                : 22 September 2019
                : 5 November 2019
                Funding
                Funded by: Australian Research Council
                Award ID: LE170100021
                Award Recipient :
                Categories
                Original Article
                Custom metadata
                © Springer Science+Business Media, LLC, part of Springer Nature 2019

                Molecular biology
                metabolomics,partial least squares,support vector machines,random forest,artificial neural network,machine learning,jupyter,open source

                Comments

                Comment on this article