135
views
0
recommends
+1 Recommend
0 collections
    2
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Conditional variable importance for random forests

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables.

          Results

          We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure.

          Conclusion

          The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach.

          Related collections

          Most cited references31

          • Record: found
          • Abstract: found
          • Article: not found

          Random forest: a classification and regression tool for compound classification and QSAR modeling.

          A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Gene selection and classification of microarray data using random forest

            Background Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. Results We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Conclusion Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              R: a language and environment for statistic computing

                Bookmark

                Author and article information

                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central
                1471-2105
                2008
                11 July 2008
                : 9
                : 307
                Affiliations
                [1 ]Department of Statistics, Ludwig-Maximilians-Universität Munchen, Ludwigstraße 33, D-80539 München, Germany
                [2 ]Sylvia Lawry Centre for Multiple Sclerosis Research, Hohenlindener Straße 1, D-81677 München, Germany
                [3 ]Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Augasse 2 – 6, A-1090 Wien, Austria
                Article
                1471-2105-9-307
                10.1186/1471-2105-9-307
                2491635
                18620558
                e9ceb6ad-11e5-450f-9f35-649d5305d1fd
                Copyright © 2008 Strobl et al; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 1 April 2008
                : 11 July 2008
                Categories
                Methodology Article

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article