Nonparametric Variable Selection, Clustering and Prediction for
  High-Dimensional Regression

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The development of parsimonious models for reliable inference and prediction of responses in high-dimensional regression settings is often challenging due to relatively small sample sizes and the presence of complex interaction patterns between a large number of covariates. We propose an efficient, nonparametric framework for simultaneous variable selection, clustering and prediction in high-throughput regression settings with continuous or discrete outcomes, called VariScan. The VariScan model utilizes the sparsity induced by Poisson-Dirichlet processes (PDPs) to group the covariates into lower-dimensional latent clusters consisting of covariates with similar patterns among the samples. The data are permitted to direct the choice of a suitable cluster allocation scheme, choosing between PDPs and their special case, a Dirichlet process. Subsequently, the latent clusters are used to build a nonlinear prediction model for the responses using an adaptive mixture of linear and nonlinear elements, thus achieving a balance between model parsimony and flexibility. We investigate theoretical properties of the VariScan procedure that differentiate the allocations patterns of PDPs and Dirichlet processes both in terms of the number and relative sizes of their clusters. Additional theoretical results guarantee the high accuracy of the model-based clustering procedure, and establish model selection and prediction consistency. Through simulation studies and analyses of benchmark data sets, we demonstrate the reliability of VariScan's clustering mechanism and show that the technique compares favorably to, and often outperforms, existing methodologies in terms of the prediction accuracies of the subject-specific responses.

Related collections

Author and article information

Journal

Publication date Created: 2014-07-21

Publication date Updated: 2016-04-13

Article

ArXiV ID: 1407.5472

SO-VID: b0246b7b-7213-4dad-95ca-51709a71cead

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments Note: this version has been substantially revised and please see new version of the article at the following link: [arXiv:1604.03615]

Categories stat.ME

ScienceOpen disciplines: Methodology

Data availability:

ScienceOpen disciplines: Methodology

Nonparametric Variable Selection, Clustering and Prediction for High-Dimensional Regression

Read this article at

Abstract

Related collections

Genomic Prediction

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 65