9
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Stable feature selection based on the ensemble L 1 -norm support vector machine for biomarker discovery

      research-article
      1 , 1 , 2 ,
      BMC Genomics
      BioMed Central
      15th International Conference On Bioinformatics (INCOB 2016) (InCOB 2016)
      21-23 September 2016

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Lately, biomarker discovery has become one of the most significant research issues in the biomedical field. Owing to the presence of high-throughput technologies, genomic data, such as microarray data and RNA-seq, have become widely available. Many kinds of feature selection techniques have been applied to retrieve significant biomarkers from these kinds of data. However, they tend to be noisy with high-dimensional features and consist of a small number of samples; thus, conventional feature selection approaches might be problematic in terms of reproducibility.

          Results

          In this article, we propose a stable feature selection method for high-dimensional datasets. We apply an ensemble L 1 -norm support vector machine to efficiently reduce irrelevant features, considering the stability of features. We define the stability score for each feature by aggregating the ensemble results, and utilize backward feature elimination on a purified feature set based on this score; therefore, it is possible to acquire an optimal set of features for performance without the need to set a specific threshold. The proposed methodology is evaluated by classifying the binary stage of renal clear cell carcinoma with RNA-seq data.

          Conclusion

          A comparison with established algorithms, i.e., a fast correlation-based filter, random forest, and an ensemble version of an L 2 -norm support vector machine-based recursive feature elimination, enabled us to prove the superior performance of our method in terms of classification as well as stability in general. It is also shown that the proposed approach performs moderately on high-dimensional datasets consisting of a very large number of features and a smaller number of samples. The proposed approach is expected to be applicable to many other researches aimed at biomarker discovery.

          Related collections

          Most cited references20

          • Record: found
          • Abstract: found
          • Article: not found

          What is a support vector machine?

          Support vector machines (SVMs) are becoming popular in a wide variety of biological applications. But, what exactly are SVMs and how do they work? And what are their most promising applications in the life sciences?
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Stability selection

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Gene selection and classification of microarray data using random forest

              Background Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. Results We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Conclusion Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
                Bookmark

                Author and article information

                Contributors
                moon@hgc.jp
                knakai@ims.u-tokyo.ac.jp
                Conference
                BMC Genomics
                BMC Genomics
                BMC Genomics
                BioMed Central (London )
                1471-2164
                22 December 2016
                22 December 2016
                2016
                : 17
                Issue : Suppl 13 Issue sponsor : Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.
                : 1026
                Affiliations
                [1 ]ISNI 0000 0001 2151 536X, GRID grid.26999.3d, Department of Computational Biology and Medical Sciences, Graduate school of Frontier Sciences, , The University of Tokyo, ; 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba-ken 277-8562 Japan
                [2 ]ISNI 0000 0001 2151 536X, GRID grid.26999.3d, Human Genome Center, The Institute of Medical Science, , The University of Tokyo, ; 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639 Japan
                Article
                3320
                10.1186/s12864-016-3320-z
                5260053
                28155664
                607ab1e5-5d12-4e60-97fb-2c73efd6b0c2
                © The Author(s). 2016

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                15th International Conference On Bioinformatics (INCOB 2016)
                InCOB 2016
                Queenstown, Singapore
                21-23 September 2016
                History
                Categories
                Research
                Custom metadata
                © The Author(s) 2016

                Genetics
                Genetics

                Comments

                Comment on this article