Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.

Related collections

Most cited references 91

Record: found
Abstract: found
Article: found

Is Open Access

Gene selection and classification of microarray data using random forest

Javier Díaz-Uriarte, Sara Alvarez de Andrés (2006)

Background Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. Results We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Conclusion Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

0 comments Cited 496 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A new approach to decoding life: systems biology.

Trey Ideker, Timothy Galitski, Leroy Hood (2001)

Systems biology studies biological systems by systematically perturbing them (biologically, genetically, or chemically); monitoring the gene, protein, and informational pathway responses; integrating these data; and ultimately, formulating mathematical models that describe the structure of the system and its response to individual perturbations. The emergence of systems biology is described, as are several examples of specific systems approaches.

0 comments Cited 315 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Metabonomics: a platform for studying drug toxicity and gene function.

Jeremy Nicholson, John C Connelly, John C. Lindon … (2002)

The later that a molecule or molecular class is lost from the drug development pipeline, the higher the financial cost. Minimizing attrition is therefore one of the most important aims of a pharmaceutical discovery programme. Novel technologies that increase the probability of making the right choice early save resources, and promote safety, efficacy and profitability. Metabonomics is a systems approach for studying in vivo metabolic profiles, which promises to provide information on drug toxicity, disease processes and gene function at several stages in the discovery-and-development process.

0 comments Cited 309 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Brief Bioinform

Journal ID (iso-abbrev): Brief. Bioinformatics

Journal ID (publisher-id): bib

Journal ID (hwp): bib

Title: Briefings in Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1467-5463

ISSN (Electronic): 1477-4054

Publication date (Print): May 2013

Publication date (Electronic): 10 July 2012

Publication date PMC-release: 10 July 2012

Volume: 14

Issue: 3

Pages: 315-326

Author notes

Corresponding author. Sacha A. F. T. van Hijum. E-mail: svhijum@ 123456cmbi.ru.nl

Article

Publisher ID: bbs034

DOI: 10.1093/bib/bbs034

PMC ID: 3659301

PubMed ID: 22786785

SO-VID: a4211595-e15f-4e01-99bb-7af81e71eaa0

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 30 March 2012

Date accepted : 26 May 2012

Page count

Pages: 12

Comments

Comment on this article

scite_

Cited by 119

See all cited by

Most referenced authors 1,465

See all reference authors

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Read this article at

Abstract

Related collections

AIP Publishing: Coronavirus

Most cited references 91

Gene selection and classification of microarray data using random forest

A new approach to decoding life: systems biology.

Metabonomics: a platform for studying drug toxicity and gene function.

Author and article information

Journal

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 53

Cited by 119

Most referenced authors 1,465