Inviting an author to review:
Find an author and click ‘Invite to review selected article’ near their name.
Search for authorsSearch for similar articles
30
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      McTwo: a two-step feature selection algorithm based on maximal information coefficient

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          High-throughput bio-OMIC technologies are producing high-dimension data from bio-samples at an ever increasing rate, whereas the training sample number in a traditional experiment remains small due to various difficulties. This “large p, small n” paradigm in the area of biomedical “big data” may be at least partly solved by feature selection algorithms, which select only features significantly associated with phenotypes. Feature selection is an NP-hard problem. Due to the exponentially increased time requirement for finding the globally optimal solution, all the existing feature selection algorithms employ heuristic rules to find locally optimal solutions, and their solutions achieve different performances on different datasets.

          Results

          This work describes a feature selection algorithm based on a recently published correlation measurement, Maximal Information Coefficient (MIC). The proposed algorithm, McTwo, aims to select features associated with phenotypes, independently of each other, and achieving high classification performance of the nearest neighbor algorithm. Based on the comparative study of 17 datasets, McTwo performs about as well as or better than existing algorithms, with significantly reduced numbers of selected features. The features selected by McTwo also appear to have particular biomedical relevance to the phenotypes from the literature.

          Conclusion

          McTwo selects a feature subset with very good classification performance, as well as a small feature number. So McTwo may represent a complementary feature selection algorithm for the high-dimensional biomedical datasets.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s12859-016-0990-0) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references20

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Bias in error estimation when using cross-validation for model selection

          Background Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data. Results We used CV to optimize the classification parameters for two kinds of classifiers; Shrunken Centroids and Support Vector Machines (SVM). Random training datasets were created, with no difference in the distribution of the features between the two classes. Using these "null" datasets, we selected classifier parameter values that minimized the CV error estimate. 10-fold CV was used for Shrunken Centroids while Leave-One-Out-CV (LOOCV) was used for the SVM. Independent test data was created to estimate the true error. With "null" and "non null" (with differential expression between the classes) data, we also tested a nested CV procedure, where an inner CV loop is used to perform the tuning of the parameters while an outer CV is used to compute an estimate of the error. The CV error estimate for the classifier with the optimal parameters was found to be a substantially biased estimate of the true error that the classifier would incur on independent data. Even though there is no real difference between the two classes for the "null" datasets, the CV error estimate for the Shrunken Centroid with the optimal parameters was less than 30% on 18.5% of simulated training data-sets. For SVM with optimal parameters the estimated error rate was less than 30% on 38% of "null" data-sets. Performance of the optimized classifiers on the independent test set was no better than chance. The nested CV procedure reduces the bias considerably and gives an estimate of the error that is very close to that obtained on the independent testing set for both Shrunken Centroids and SVM classifiers for "null" and "non-null" data distributions. Conclusion We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Statistical comparison of classifiers over multiple data sets

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Selection bias in gene extraction on the basis of microarray gene-expression data.

              In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called .632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.
                Bookmark

                Author and article information

                Contributors
                madl1234@126.com
                qing@jlu.edu.cn
                +86-755-86392200 , FengfengZhou@gmail.com , ff.zhou@siat.ac.cn
                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central (London )
                1471-2105
                23 March 2016
                23 March 2016
                2016
                : 17
                : 142
                Affiliations
                [ ]Shenzhen Institutes of Advanced Technology, and Key Lab for Health Informatics, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, Guangdong 518055 P.R. China
                [ ]Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, Guangdong 518055 P.R. China
                [ ]School of Science, Hubei University of Technology, Wuhan, Hubei 430068 P.R. China
                [ ]Shenzhen Children’s Hospital, Shenzhen, Guangdong 518026 P.R. China
                [ ]Department of Pathogenobiology, Basic Medical College of Jilin University, Changchun, Jilin China
                Article
                990
                10.1186/s12859-016-0990-0
                4804474
                27006077
                2d23920b-59d5-458d-b663-7deac1c5562a
                © Ge et al. 2016

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 2 December 2015
                : 14 March 2016
                Funding
                Funded by: Strategic Priority Research Program of the Chinese Academy of Sciences
                Award ID: XDB13040400
                Award Recipient :
                Funded by: Shenzhen Peacock Plan
                Award ID: KQCX20130628112914301
                Award ID: KQCX20130628112914291
                Award Recipient :
                Funded by: Shenzhen Science and Technology Grants
                Award ID: JCYJ20130401114111457
                Award Recipient :
                Funded by: China 863 program
                Award ID: SS2015AA020109-4
                Award Recipient :
                Funded by: Shenzhen Research Grants
                Award ID: JCYJ20130401170306884
                Award Recipient :
                Funded by: Key Laboratory of Human-Machine-Intelligence Synergic Systems, Chinese Academy of Sciences
                Funded by: MOE Humanities Social Sciences Fund
                Award ID: 13YJC790105
                Award Recipient :
                Funded by: Doctoral Research Fund of HBUT
                Award ID: BSQD13050
                Award Recipient :
                Categories
                Methodology Article
                Custom metadata
                © The Author(s) 2016

                Bioinformatics & Computational biology
                maximal information coefficient (mic),heuristic algorithm,feature selection,filter algorithm,wrapper algorithm

                Comments

                Comment on this article