Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Recent genome-wide association studies (GWAS) with metabolomics data linked genetic variation in the human genome to differences in individual metabolite levels. A strong relevance of this metabolic individuality for biomedical and pharmaceutical research has been reported. However, a considerable amount of the molecules currently quantified by modern metabolomics techniques are chemically unidentified. The identification of these “unknown metabolites ” is still a demanding and intricate task, limiting their usability as functional markers of metabolic processes. As a consequence, previous GWAS largely ignored unknown metabolites as metabolic traits for the analysis. Here we present a systems-level approach that combines genome-wide association analysis and Gaussian graphical modeling with metabolomics to predict the identity of the unknown metabolites. We apply our method to original data of 517 metabolic traits, of which 225 are unknowns, and genotyping information on 655,658 genetic variants, measured in 1,768 human blood samples. We report previously undescribed genotype–metabotype associations for six distinct gene loci (SLC22A2, COMT, CYP3A5, CYP2C18, GBA3, UGT3A1) and one locus not related to any known gene (rs12413935). Overlaying the inferred genetic associations, metabolic networks, and knowledge-based pathway information, we derive testable hypotheses on the biochemical identities of 106 unknown metabolites. As a proof of principle, we experimentally confirm nine concrete predictions. We demonstrate the benefit of our method for the functional interpretation of previous metabolomics biomarker studies on liver detoxification, hypertension, and insulin resistance. Our approach is generic in nature and can be directly transferred to metabolomics data from different experimental platforms.

Author Summary

Genome-wide association studies on metabolomics data have demonstrated that genetic variation in metabolic enzymes and transporters leads to concentration changes in the respective metabolite levels. The conventional goal of these studies is the detection of novel interactions between the genome and the metabolic system, providing valuable insights for both basic research as well as clinical applications. In this study, we borrow the metabolomics GWAS concept for a novel, entirely different purpose. Metabolite measurements frequently produce signals where a certain substance can be reliably detected in the sample, but it has not yet been elucidated which specific metabolite this signal actually represents. The concept is comparable to a fingerprint: each one is uniquely identifiable, but as long as it is not registered in a database one cannot tell to whom this fingerprint belongs. Obviously, this issue tremendously reduces the usability of a metabolomics analyses. The genetic associations of such an “unknown,” however, give us concrete evidence of the metabolic pathway this substance is most probably involved in. Moreover, we complement the approach with a specific measure of correlation between metabolites, providing further evidence of the metabolic processes of the unknown. For a number of cases, this even allows for a concrete identity prediction, which we then experimentally validate in the lab.

Related collections

Most cited references 29

Record: found
Abstract: found
Article: not found

MassBank: a public repository for sharing mass spectral data for life sciences.

Hisayuki Horai, Masanori Arita, Shigehiko Kanaya … (2010)

MassBank is the first public repository of mass spectra of small chemical compounds for life sciences (<3000 Da). The database contains 605 electron-ionization mass spectrometry (EI-MS), 137 fast atom bombardment MS and 9276 electrospray ionization (ESI)-MS(n) data of 2337 authentic compounds of metabolites, 11 545 EI-MS and 834 other-MS data of 10,286 volatile natural and synthetic compounds, and 3045 ESI-MS(2) data of 679 synthetic drugs contributed by 16 research groups (January 2010). ESI-MS(2) data were analyzed under nonstandardized, independent experimental conditions. MassBank is a distributed database. Each research group provides data from its own MassBank data servers distributed on the Internet. MassBank users can access either all of the MassBank data or a subset of the data by specifying one or more experimental conditions. In a spectral search to retrieve mass spectra similar to a query mass spectrum, the similarity score is calculated by a weighted cosine correlation in which weighting exponents on peak intensity and the mass-to-charge ratio are optimized to the ESI-MS(2) data. MassBank also provides a merged spectrum for each compound prepared by merging the analyzed ESI-MS(2) data on an identical compound under different collision-induced dissociation conditions. Data merging has significantly improved the precision of the identification of a chemical compound by 21-23% at a similarity score of 0.6. Thus, MassBank is useful for the identification of chemical compounds and the publication of experimental data. 2010 John Wiley & Sons, Ltd.

0 comments Cited 684 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

HMDB: a knowledgebase for the human metabolome

David Wishart, Craig Knox, An Guo … (2009)

The Human Metabolome Database (HMDB, http://www.hmdb.ca) is a richly annotated resource that is designed to address the broad needs of biochemists, clinical chemists, physicians, medical geneticists, nutritionists and members of the metabolomics community. Since its first release in 2007, the HMDB has been used to facilitate the research for nearly 100 published studies in metabolomics, clinical biochemistry and systems biology. The most recent release of HMDB (version 2.0) has been significantly expanded and enhanced over the previous release (version 1.0). In particular, the number of fully annotated metabolite entries has grown from 2180 to more than 6800 (a 300% increase), while the number of metabolites with biofluid or tissue concentration data has grown by a factor of five (from 883 to 4413). Similarly, the number of purified compounds with reference to NMR, LC-MS and GC-MS spectra has more than doubled (from 380 to more than 790 compounds). In addition to this significant expansion in database size, many new database searching tools and new data content has been added or enhanced. These include better algorithms for spectral searching and matching, more powerful chemical substructure searches, faster text searching software, as well as dedicated pathway searching tools and customized, clickable metabolic maps. Changes to the user-interface have also been implemented to accommodate future expansion and to make database navigation much easier. These improvements should make the HMDB much more useful to a much wider community of users.

0 comments Cited 539 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry

Tobias Kind, Oliver Fiehn (2007)

Background Structure elucidation of unknown small molecules by mass spectrometry is a challenge despite advances in instrumentation. The first crucial step is to obtain correct elemental compositions. In order to automatically constrain the thousands of possible candidate structures, rules need to be developed to select the most likely and chemically correct molecular formulas. Results An algorithm for filtering molecular formulas is derived from seven heuristic rules: (1) restrictions for the number of elements, (2) LEWIS and SENIOR chemical rules, (3) isotopic patterns, (4) hydrogen/carbon ratios, (5) element ratio of nitrogen, oxygen, phosphor, and sulphur versus carbon, (6) element ratio probabilities and (7) presence of trimethylsilylated compounds. Formulas are ranked according to their isotopic patterns and subsequently constrained by presence in public chemical databases. The seven rules were developed on 68,237 existing molecular formulas and were validated in four experiments. First, 432,968 formulas covering five million PubChem database entries were checked for consistency. Only 0.6% of these compounds did not pass all rules. Next, the rules were shown to effectively reducing the complement all eight billion theoretically possible C, H, N, S, O, P-formulas up to 2000 Da to only 623 million most probable elemental compositions. Thirdly 6,000 pharmaceutical, toxic and natural compounds were selected from DrugBank, TSCA and DNP databases. The correct formulas were retrieved as top hit at 80–99% probability when assuming data acquisition with complete resolution of unique compounds and 5% absolute isotope ratio deviation and 3 ppm mass accuracy. Last, some exemplary compounds were analyzed by Fourier transform ion cyclotron resonance mass spectrometry and by gas chromatography-time of flight mass spectrometry. In each case, the correct formula was ranked as top hit when combining the seven rules with database queries. Conclusion The seven rules enable an automatic exclusion of molecular formulas which are either wrong or which contain unlikely high or low number of elements. The correct molecular formula is assigned with a probability of 98% if the formula exists in a compound database. For truly novel compounds that are not present in databases, the correct formula is found in the first three hits with a probability of 65–81%. Corresponding software and supplemental data are available for downloads from the authors' website.

0 comments Cited 235 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Mark I. McCarthy: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Genet

Journal ID (iso-abbrev): PLoS Genet

Journal ID (publisher-id): plos

Journal ID (pmc): plosgen

Title: PLoS Genetics

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Print): 1553-7390

ISSN (Electronic): 1553-7404

Publication date Collection: October 2012

Publication date (Print): October 2012

Publication date (Electronic): 18 October 2012

Volume: 8

Issue: 10

Electronic Location Identifier: e1003005

Affiliations

[1 ]Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, Neuherberg, Germany

[2 ]Department of Physiology and Biophysics, Weill Cornell Medical College in Qatar, Education City, Qatar Foundation, Doha, Qatar

[3 ]Metabolon, Research Triangle Park, North Carolina, United States of America

[4 ]Department of Genome-Oriented Bioinformatics, Life and Food Science Center Weihenstephan, Technische Universität München, Freising, Germany

[5 ]Research Unit of Molecular Epidemiology, Helmholtz Zentrum München, Neuherberg, Germany

[6 ]Biobank of the Hanover Medical School, Hanover Medical School, Hanover, Germany

[7 ]Institute of Experimental Genetics, Genome Analysis Center, Helmholtz Zentrum München, Neuherberg, Germany

[8 ]Lehrstuhl für Experimentelle Genetik, Technische Universität München, Freising-Weihenstephan, Germany

[9 ]Institute of Epidemiology, Helmholtz Zentrum München, Neuherberg, Germany

[10 ]Department of Mathematics, Technische Universität München, Garching, Germany

University of Oxford, United Kingdom

Author notes

* E-mail: g.kastenmueller@ 123456helmholtz-muenchen.de

AME, MWM, RPM, and MVM are employees of Metabolon. A patent application for the unknown identification method has been filed: “Identity Elucidation of Unknown Metabolites,” U.S. Patent Application No. 61503673, unpublished – filing date July 1, 2011 (MVM, applicant).

Conceived and designed the experiments: JK KS FJT GK. Performed the experiments: AME MWM RPM MVM. Analyzed the data: JK GK. Contributed reagents/materials/analysis tools: BW WR-M TI JA CG. Wrote the paper: JK KS FJT GK.

Article

Publisher ID: PGENETICS-D-12-00784

DOI: 10.1371/journal.pgen.1003005

PMC ID: 3475673

PubMed ID: 23093944

SO-VID: 7b6f70f8-f15f-4ac0-8256-d9eec43291f2

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 30 March 2012

Date accepted : 16 August 2012

Page count

Pages: 14

Funding

This work was funded in part by a grant from the German Federal Ministry of Education and Research (BMBF) to the German Center for Diabetes Research (DZD e.V.), by the European Research Council (starting grant “LatentCauses”), by BMBF Grant no. 03IS2061B (project Gani_Med), by BMBF Grant no. 0315494A (project SysMBo), by Era-Net grant no. 0315442A (project PathoGenoMics), and by the Initiative and Networking Fund of the Helmholtz Association within the Helmholtz Alliance on Systems Biology (project CoReNe). JK is supported by a PhD student fellowship from the “Studienstiftung des Deutschen Volkes.” KS is supported by Qatar Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information

Read this article at

Abstract

Author Summary

Related collections

Genomic Prediction

Most cited references 29

MassBank: a public repository for sharing mass spectral data for life sciences.

HMDB: a knowledgebase for the human metabolome

Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 3

Cited by 79