SVM-RFE: selection and visualization of the most relevant features through non-linear kernels

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Support vector machines (SVM) are a powerful tool to analyze data with a number of predictors approximately equal or larger than the number of observations. However, originally, application of SVM to analyze biomedical data was limited because SVM was not designed to evaluate importance of predictor variables. Creating predictor models based on only the most relevant variables is essential in biomedical research. Currently, substantial work has been done to allow assessment of variable importance in SVM models but this work has focused on SVM implemented with linear kernels. The power of SVM as a prediction model is associated with the flexibility generated by use of non-linear kernels. Moreover, SVM has been extended to model survival outcomes. This paper extends the Recursive Feature Elimination (RFE) algorithm by proposing three approaches to rank variables based on non-linear SVM and SVM for survival analysis.

Results

The proposed algorithms allows visualization of each one the RFE iterations, and hence, identification of the most relevant predictors of the response variable. Using simulation studies based on time-to-event outcomes and three real datasets, we evaluate the three methods, based on pseudo-samples and kernel principal component analysis, and compare them with the original SVM-RFE algorithm for non-linear kernels. The three algorithms we proposed performed generally better than the gold standard RFE for non-linear kernels, when comparing the truly most relevant variables with the variable ranks produced by each algorithm in simulation studies. Generally, the RFE-pseudo-samples outperformed the other three methods, even when variables were assumed to be correlated in all tested scenarios.

Conclusions

The proposed approaches can be implemented with accuracy to select variables and assess direction and strength of associations in analysis of biomedical data using SVM for categorical or time-to-event responses. Conducting variable selection and interpreting direction and strength of associations between predictors and outcomes with the proposed approaches, particularly with the RFE-pseudo-samples approach can be implemented with accuracy when analyzing biomedical data. These approaches, perform better than the classical RFE of Guyon for realistic scenarios about the structure of biomedical data.

Electronic supplementary material

The online version of this article (10.1186/s12859-018-2451-4) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 15

Record: found
Abstract: not found
Article: not found

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median

Christophe Leys, Christophe Ley, Olivier Klein … (2013)

0 comments Cited 829 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Generating survival times to simulate Cox proportional hazards models.

Ralf Bender, Thomas Augustin, Maria Blettner (2005)

Simulation studies present an important statistical tool to investigate the performance, properties and adequacy of statistical models in pre-specified situations. One of the most important statistical models in medical research is the proportional hazards model of Cox. In this paper, techniques to generate survival times for simulation studies regarding Cox proportional hazards models are presented. A general formula describing the relation between the hazard and the corresponding survival time of the Cox model is derived, which is useful in simulation studies. It is shown how the exponential, the Weibull and the Gompertz distribution can be applied to generate appropriate survival times for simulation studies. Additionally, the general relation between hazard and survival time can be used to develop own distributions for special situations and to handle flexibly parameterized proportional hazards models. The use of distributions other than the exponential distribution is indispensable to investigate the characteristics of the Cox proportional hazards model, especially in non-standard situations, where the partial likelihood depends on the baseline hazard. A simulation study investigating the effect of measurement errors in the German Uranium Miners Cohort Study is considered to illustrate the proposed simulation techniques and to emphasize the importance of a careful modelling of the baseline hazard in Cox models. Copyright 2005 John Wiley & Sons, Ltd

0 comments Cited 197 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Input space versus feature space in kernel-based methods.

B Schölkopf, S. Mika, C.J.C. Burges … (1999)

This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with support vector (SV) kernel functions. We first discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature space map, and how this influences the capacity of SV methods. Following this, we describe how the metric governing the intrinsic geometry of the mapped surface can be computed in terms of the kernel, using the example of the class of inhomogeneous polynomial kernels, which are often used in SV pattern recognition. We then discuss the connection between feature space and input space by dealing with the question of how one can, given some vector in feature space, find a preimage (exact or approximate) in input space. We describe algorithms to tackle this issue, and show their utility in two applications of kernel methods. First, we use it to reduce the computational complexity of SV decision functions; second, we combine it with the Kernel PCA algorithm, thereby constructing a nonlinear statistical denoising technique which is shown to perform well on real-world data.

0 comments Cited 112 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Hector Sanz:

ORCID: http://orcid.org/0000-0001-6540-8427

hsrodenas@gmail.com

Clarissa Valim: cvalim@hsph.harvard.edu

Esteban Vegas: evegas@ub.edu

Josep M. Oller: joller@ub.edu

Ferran Reverter: freverter@ub.edu

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date (Electronic): 19 November 2018

Publication date PMC-release: 19 November 2018

Publication date Collection: 2018

Volume: 19

Electronic Location Identifier: 432

Affiliations

[1 ]ISNI 0000 0004 1937 0247, GRID grid.5841.8, Department of Genetics, Microbiology and Statistics, Faculty of Biology, , Universitat de Barcelona, ; Diagonal, 643, 08028 Barcelona, Catalonia Spain

[2 ]ISNI 0000 0001 2150 1785, GRID grid.17088.36, Department of Osteopathic Medical Specialties, , Michigan State University, ; 909 Fee Road, Room B 309 West Fee Hall, East Lansing, MI 48824 USA

[3 ]Department of Immunology and Infectious Diseases, Harvard T.H. Chen School of Public Health, 675 Huntington Ave, Boston, MA 02115 USA

[4 ]GRID grid.473715.3, Centre for Genomic Regulation (CRG), , The Barcelona Institute for Science and Technology, ; Dr. Aiguader 88, 08003 Barcelona, Spain

Author information

Hector Sanz http://orcid.org/0000-0001-6540-8427

Article

Publisher ID: 2451

DOI: 10.1186/s12859-018-2451-4

PMC ID: 6245920

PubMed ID: 30453885

SO-VID: 3bc6ff8e-8228-42fd-8336-3ccc31f7b536

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 7 May 2018

Date accepted : 30 October 2018

Funding

Funded by: MINECO/FEDER

Award ID: MTM2015-64465-C2-1-R

Award Recipient : Esteban Vegas Josep M. Oller

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: support vector machines,relevant variables,recursive feature elimination,kernel methods

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: support vector machines, relevant variables, recursive feature elimination, kernel methods

SVM-RFE: selection and visualization of the most relevant features through non-linear kernels

Read this article at

Abstract

Background

Results

Conclusions

Electronic supplementary material

Related collections

REPO4EU WP2 Databases

Most cited references 15

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median

Generating survival times to simulate Cox proportional hazards models.

Input space versus feature space in kernel-based methods.

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 132

Cited by 197

Most referenced authors 218