Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.

Related collections

Most cited references 50

Record: found
Abstract: found
Book: not found

Multivariate Data Analysis

Joseph F. Hair (2010)

For over 30 years, this text has provided students with the information they need to understand and apply multivariate data analysis. This text provides an applications-oriented introduction to multivariate analysis for the non-statistician. By reducing heavy statistical research into fundamental concepts, the text explains to students how to understand and make use of the results of specific statistical techniques. In this revision, the organization of the chapters has been greatly simplified. New chapters have been added on structural equations modeling, and all sections have been updated to reflect advances in technology, capability, and mathematical techniques. :Pearson New International Edition.

0 comments Cited 1789 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

MetaboAnalyst: a web server for metabolomic data analysis and interpretation

Jianguo Xia, Nick Psychogios, Nelson Young … (2009)

Metabolomics is a newly emerging field of ‘omics’ research that is concerned with characterizing large numbers of metabolites using NMR, chromatography and mass spectrometry. It is frequently used in biomarker identification and the metabolic profiling of cells, tissues or organisms. The data processing challenges in metabolomics are quite unique and often require specialized (or expensive) data analysis software and a detailed knowledge of cheminformatics, bioinformatics and statistics. In an effort to simplify metabolomic data analysis while at the same time improving user accessibility, we have developed a freely accessible, easy-to-use web server for metabolomic data analysis called MetaboAnalyst. Fundamentally, MetaboAnalyst is a web-based metabolomic data processing tool not unlike many of today's web-based microarray analysis packages. It accepts a variety of input data (NMR peak lists, binned spectra, MS peak lists, compound/concentration data) in a wide variety of formats. It also offers a number of options for metabolomic data processing, data normalization, multivariate statistical analysis, graphing, metabolite identification and pathway mapping. In particular, MetaboAnalyst supports such techniques as: fold change analysis, t-tests, PCA, PLS-DA, hierarchical clustering and a number of more sophisticated statistical or machine learning methods. It also employs a large library of reference spectra to facilitate compound identification from most kinds of input spectra. MetaboAnalyst guides users through a step-by-step analysis pipeline using a variety of menus, information hyperlinks and check boxes. Upon completion, the server generates a detailed report describing each method used, embedded with graphical and tabular outputs. MetaboAnalyst is capable of handling most kinds of metabolomic data and was designed to perform most of the common kinds of metabolomic data analyses. MetaboAnalyst is accessible at http://www.metaboanalyst.ca

0 comments Cited 742 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Statistical pattern recognition: a review

A.K. Jain, Robert P.W Duin, Jianchang Mao (2000)

0 comments Cited 595 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Metabolites

Journal ID (iso-abbrev): Metabolites

Journal ID (publisher-id): metabolites

Title: Metabolites

Publisher: MDPI

ISSN (Electronic): 2218-1989

Publication date (Electronic): 16 June 2014

Publication date Collection: June 2014

Volume: 4

Issue: 2

Pages: 433-452

Affiliations

[1 ]School of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK; E-Mails: piotr.gromski@ 123456postgrad.manchester.ac.uk (P.S.G.); Yun.Xu-2@ 123456manchester.ac.uk (Y.X.); helen.kotze@ 123456postgrad.manchester.ac.uk (H.L.K.); E.S.Correa@ 123456manchester.ac.uk (E.C.); D.Ellis@ 123456manchester.ac.uk (D.I.E.); emily.armitage@ 123456ceu.es (E.G.A.)

[2 ]School of Chemistry, Brunswick Street, The University of Manchester, Manchester M13 9PL, UK. E-Mail: Michael.Turner@ 123456manchester.ac.uk (M.L.T.)

Author notes

[* ] Author to whom correspondence should be addressed; E-Mail: roy.goodacre@ 123456manchester.ac.uk ; Tel.: +44-(0)-161-306-4480.

Article

Publisher ID: metabolites-04-00433

DOI: 10.3390/metabo4020433

PMC ID: 4101515

PubMed ID: 24957035

SO-VID: f590a451-5eaa-4215-8cc2-2d11720708d1

License:

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).

History

Date received : 14 April 2014

Date revision received : 29 May 2014

Date accepted : 05 June 2014

Comments

Comment on this article

scite_

Cited by 67

See all cited by

Most referenced authors 1,312

See all reference authors

Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data

Read this article at

Abstract

Related collections

Software for SAXS correction and analysis

Most cited references 50

Multivariate Data Analysis

MetaboAnalyst: a web server for metabolomic data analysis and interpretation

Statistical pattern recognition: a review

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 187

Cited by 67

Most referenced authors 1,312