+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data


      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.

          Related collections

          Most cited references50

          • Record: found
          • Abstract: found
          • Book: not found

          Multivariate Data Analysis

          For over 30 years, this text has provided students with the information they need to understand and apply multivariate data analysis. This text provides an applications-oriented introduction to multivariate analysis for the non-statistician. By reducing heavy statistical research into fundamental concepts, the text explains to students how to understand and make use of the results of specific statistical techniques. In this revision, the organization of the chapters has been greatly simplified. New chapters have been added on structural equations modeling, and all sections have been updated to reflect advances in technology, capability, and mathematical techniques. :Pearson New International Edition.
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            MetaboAnalyst: a web server for metabolomic data analysis and interpretation

            Metabolomics is a newly emerging field of ‘omics’ research that is concerned with characterizing large numbers of metabolites using NMR, chromatography and mass spectrometry. It is frequently used in biomarker identification and the metabolic profiling of cells, tissues or organisms. The data processing challenges in metabolomics are quite unique and often require specialized (or expensive) data analysis software and a detailed knowledge of cheminformatics, bioinformatics and statistics. In an effort to simplify metabolomic data analysis while at the same time improving user accessibility, we have developed a freely accessible, easy-to-use web server for metabolomic data analysis called MetaboAnalyst. Fundamentally, MetaboAnalyst is a web-based metabolomic data processing tool not unlike many of today's web-based microarray analysis packages. It accepts a variety of input data (NMR peak lists, binned spectra, MS peak lists, compound/concentration data) in a wide variety of formats. It also offers a number of options for metabolomic data processing, data normalization, multivariate statistical analysis, graphing, metabolite identification and pathway mapping. In particular, MetaboAnalyst supports such techniques as: fold change analysis, t-tests, PCA, PLS-DA, hierarchical clustering and a number of more sophisticated statistical or machine learning methods. It also employs a large library of reference spectra to facilitate compound identification from most kinds of input spectra. MetaboAnalyst guides users through a step-by-step analysis pipeline using a variety of menus, information hyperlinks and check boxes. Upon completion, the server generates a detailed report describing each method used, embedded with graphical and tabular outputs. MetaboAnalyst is capable of handling most kinds of metabolomic data and was designed to perform most of the common kinds of metabolomic data analyses. MetaboAnalyst is accessible at http://www.metaboanalyst.ca
              • Record: found
              • Abstract: not found
              • Article: not found

              Statistical pattern recognition: a review


                Author and article information

                16 June 2014
                June 2014
                : 4
                : 2
                : 433-452
                [1 ]School of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK; E-Mails: piotr.gromski@ 123456postgrad.manchester.ac.uk (P.S.G.); Yun.Xu-2@ 123456manchester.ac.uk (Y.X.); helen.kotze@ 123456postgrad.manchester.ac.uk (H.L.K.); E.S.Correa@ 123456manchester.ac.uk (E.C.); D.Ellis@ 123456manchester.ac.uk (D.I.E.); emily.armitage@ 123456ceu.es (E.G.A.)
                [2 ]School of Chemistry, Brunswick Street, The University of Manchester, Manchester M13 9PL, UK. E-Mail: Michael.Turner@ 123456manchester.ac.uk (M.L.T.)
                Author notes
                [* ] Author to whom correspondence should be addressed; E-Mail: roy.goodacre@ 123456manchester.ac.uk ; Tel.: +44-(0)-161-306-4480.
                © 2014 by the authors; licensee MDPI, Basel, Switzerland.

                This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).

                : 14 April 2014
                : 29 May 2014
                : 05 June 2014

                missing values,metabolomics,unsupervised learning,supervised learning


                Comment on this article