McTwo: a two-step feature selection algorithm based on maximal information coefficient

Ge, Ruiquan; Zhou, Manli; Luo, Youxi; Meng, Qinghan; Mai, Guoqin; Ma, Dongli; Wang, Guoqing; Zhou, Fengfeng

doi:10.1186/s12859-016-0990-0

ScienceOpen: research and publishing network

For Publishers

For Researchers

Blog
About

Search
Advanced search

views

recommends

Record: found
Abstract: found
Article: found

Is Open Access

McTwo: a two-step feature selection algorithm based on maximal information coefficient

research-article

Author(s): Ruiquan Ge , Manli Zhou , Youxi Luo , Qinghan Meng , Guoqin Mai , Dongli Ma , Guoqing Wang , Fengfeng Zhou

Publication date (Electronic): 23 March 2016

Journal: BMC Bioinformatics

Publisher: BioMed Central

Keywords: Maximal information coefficient (MIC), Heuristic algorithm, Feature selection, Filter algorithm, Wrapper algorithm

Read this article at

ScienceOpen Publisher PMC

Bookmark

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

High-throughput bio-OMIC technologies are producing high-dimension data from bio-samples at an ever increasing rate, whereas the training sample number in a traditional experiment remains small due to various difficulties. This “large p, small n” paradigm in the area of biomedical “big data” may be at least partly solved by feature selection algorithms, which select only features significantly associated with phenotypes. Feature selection is an NP-hard problem. Due to the exponentially increased time requirement for finding the globally optimal solution, all the existing feature selection algorithms employ heuristic rules to find locally optimal solutions, and their solutions achieve different performances on different datasets.

Results

This work describes a feature selection algorithm based on a recently published correlation measurement, Maximal Information Coefficient (MIC). The proposed algorithm, McTwo, aims to select features associated with phenotypes, independently of each other, and achieving high classification performance of the nearest neighbor algorithm. Based on the comparative study of 17 datasets, McTwo performs about as well as or better than existing algorithms, with significantly reduced numbers of selected features. The features selected by McTwo also appear to have particular biomedical relevance to the phenotypes from the literature.

Conclusion

McTwo selects a feature subset with very good classification performance, as well as a small feature number. So McTwo may represent a complementary feature selection algorithm for the high-dimensional biomedical datasets.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-016-0990-0) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 20

Record: found
Abstract: found
Article: found

Is Open Access

Bias in error estimation when using cross-validation for model selection

Sudhir Varma, Richard M Simon (2006)

Background Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data. Results We used CV to optimize the classification parameters for two kinds of classifiers; Shrunken Centroids and Support Vector Machines (SVM). Random training datasets were created, with no difference in the distribution of the features between the two classes. Using these "null" datasets, we selected classifier parameter values that minimized the CV error estimate. 10-fold CV was used for Shrunken Centroids while Leave-One-Out-CV (LOOCV) was used for the SVM. Independent test data was created to estimate the true error. With "null" and "non null" (with differential expression between the classes) data, we also tested a nested CV procedure, where an inner CV loop is used to perform the tuning of the parameters while an outer CV is used to compute an estimate of the error. The CV error estimate for the classifier with the optimal parameters was found to be a substantially biased estimate of the true error that the classifier would incur on independent data. Even though there is no real difference between the two classes for the "null" datasets, the CV error estimate for the Shrunken Centroid with the optimal parameters was less than 30% on 18.5% of simulated training data-sets. For SVM with optimal parameters the estimated error rate was less than 30% on 38% of "null" data-sets. Performance of the optimized classifiers on the independent test set was no better than chance. The nested CV procedure reduces the bias considerably and gives an estimate of the error that is very close to that obtained on the independent testing set for both Shrunken Centroids and SVM classifiers for "null" and "non-null" data distributions. Conclusion We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.

0 comments Cited 400 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Statistical comparison of classifiers over multiple data sets

J. DEMSAR, J Demšar, Janez Demšar … (2006)

0 comments Cited 378 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Selection bias in gene extraction on the basis of microarray gene-expression data.

C Ambroise, G. J. McLachlan (2002)

In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called .632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.

0 comments Cited 327 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Dongli Ma: madl1234@126.com

Guoqing Wang: qing@jlu.edu.cn

Fengfeng Zhou: +86-755-86392200 , FengfengZhou@gmail.com , ff.zhou@siat.ac.cn

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date (Electronic): 23 March 2016

Publication date PMC-release: 23 March 2016

Publication date Collection: 2016

Volume: 17

Electronic Location Identifier: 142

Affiliations

[ ]Shenzhen Institutes of Advanced Technology, and Key Lab for Health Informatics, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, Guangdong 518055 P.R. China

[ ]Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, Guangdong 518055 P.R. China

[ ]School of Science, Hubei University of Technology, Wuhan, Hubei 430068 P.R. China

[ ]Shenzhen Children’s Hospital, Shenzhen, Guangdong 518026 P.R. China

[ ]Department of Pathogenobiology, Basic Medical College of Jilin University, Changchun, Jilin China

Article

Publisher ID: 990

DOI: 10.1186/s12859-016-0990-0

PMC ID: 4804474

PubMed ID: 27006077

SO-VID: 2d23920b-59d5-458d-b663-7deac1c5562a

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 2 December 2015

Date accepted : 14 March 2016

Funding

Funded by: Strategic Priority Research Program of the Chinese Academy of Sciences

Award ID: XDB13040400

Award Recipient : Fengfeng Zhou

Funded by: Shenzhen Peacock Plan

Award ID: KQCX20130628112914301

Award ID: KQCX20130628112914291

Award Recipient : Fengfeng Zhou

Funded by: Shenzhen Science and Technology Grants

Award ID: JCYJ20130401114111457

Award Recipient : Dongli Ma

Funded by: China 863 program

Award ID: SS2015AA020109-4

Award Recipient : Fengfeng Zhou

Funded by: Shenzhen Research Grants

Award ID: JCYJ20130401170306884

Award Recipient : Fengfeng Zhou

Funded by: Key Laboratory of Human-Machine-Intelligence Synergic Systems, Chinese Academy of Sciences

Funded by: MOE Humanities Social Sciences Fund

Award ID: 13YJC790105

Award Recipient : Youxi Luo

Funded by: Doctoral Research Fund of HBUT

Award ID: BSQD13050

Award Recipient : Youxi Luo

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: maximal information coefficient (mic),heuristic algorithm,feature selection,filter algorithm,wrapper algorithm

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: maximal information coefficient (mic), heuristic algorithm, feature selection, filter algorithm, wrapper algorithm

McTwo: a two-step feature selection algorithm based on maximal information coefficient

Read this article at

Abstract

Background

Results

Conclusion

Electronic supplementary material

Related collections

Genetoberfest

Most cited references 20

Bias in error estimation when using cross-validation for model selection

Statistical comparison of classifiers over multiple data sets

Selection bias in gene extraction on the basis of microarray gene-expression data.

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 94

Cited by 26

Most referenced authors 277