Random KNN feature selection - a fast and stable alternative to Random Forests

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Successfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely used approach to feature selection for such "small n, large p problems." However, Random Forests suffers from instability, especially in the presence of noisy and/or unbalanced inputs.

Results

We present RKNN-FS, an innovative feature selection procedure for "small n, large p problems." RKNN-FS is based on Random KNN (RKNN), a novel generalization of traditional nearest-neighbor modeling. RKNN consists of an ensemble of base k-nearest neighbor models, each constructed from a random subset of the input variables. To rank the importance of the variables, we define a criterion on the RKNN framework, using the notion of support. A two-stage backward model selection method is then developed based on this criterion. Empirical results on microarray data sets with thousands of variables and relatively few samples show that RKNN-FS is an effective feature selection approach for high-dimensional data. RKNN is similar to Random Forests in terms of classification accuracy without feature selection. However, RKNN provides much better classification accuracy than RF when each method incorporates a feature-selection step. Our results show that RKNN is significantly more stable and more robust than Random Forests for feature selection when the input data are noisy and/or unbalanced. Further, RKNN-FS is much faster than the Random Forests feature selection method (RF-FS), especially for large scale problems, involving thousands of variables and multiple classes.

Conclusions

Given the superiority of Random KNN in classification performance when compared with Random Forests, RKNN-FS's simplicity and ease of implementation, and its superiority in speed and stability, we propose RKNN-FS as a faster and more stable alternative to Random Forests in classification problems involving feature selection for high-dimensional datasets.

Related collections

Most cited references 16

Record: found
Abstract: not found
Article: not found

Wrappers for feature subset selection

Ron Kohavi, George H. John (1997)

0 comments Cited 1042 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

The random subspace method for constructing decision forests

Tin Ho (1998)

0 comments Cited 852 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Random forest: a classification and regression tool for compound classification and QSAR modeling.

Vladimir Svetnik, Andy Liaw, Christopher Tong … (2003)

A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.

0 comments Cited 718 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2011

Publication date (Electronic): 18 November 2011

Volume: 12

Page: 450

Affiliations

[1 ]The Department of Statistics, West Virginia University, Morgantown, WV 26506, USA

[2 ]Health Effects Laboratory Division, the National Institute for Occupational Safety and Health, Morgantown, WV 26505, USA

[3 ]The Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26506, USA

Article

Publisher ID: 1471-2105-12-450

DOI: 10.1186/1471-2105-12-450

PMC ID: 3281073

PubMed ID: 22093447

SO-VID: 2ffd2f4e-29d6-4a62-8d60-284120a97e67

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Random KNN feature selection - a fast and stable alternative to Random Forests

Read this article at

Abstract

Background

Results

Conclusions

Related collections

AIP Publishing: Coronavirus

Most cited references 16

Wrappers for feature subset selection

The random subspace method for constructing decision forests

Random forest: a classification and regression tool for compound classification and QSAR modeling.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 40

Cited by 18

Most referenced authors 681