Feature selection in omics prediction problems using cat scores and

 false nondiscovery rate control

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted \(t\)-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James--Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package ``sda'' available from the R repository CRAN.

Related collections

Most cited references 16

Record: found
Abstract: found
Article: not found

Prediction of central nervous system embryonal tumour outcome based on gene expression.

Scott Pomeroy, Pablo Tamayo, Michelle Gaasenbeek … (2002)

Embryonal tumours of the central nervous system (CNS) represent a heterogeneous group of tumours about which little is known biologically, and whose diagnosis, on the basis of morphologic appearance alone, is controversial. Medulloblastomas, for example, are the most common malignant brain tumour of childhood, but their pathogenesis is unknown, their relationship to other embryonal CNS tumours is debated, and patients' response to therapy is difficult to predict. We approached these problems by developing a classification system based on DNA microarray gene expression data derived from 99 patient samples. Here we demonstrate that medulloblastomas are molecularly distinct from other brain tumours including primitive neuroectodermal tumours (PNETs), atypical teratoid/rhabdoid tumours (AT/RTs) and malignant gliomas. Previously unrecognized evidence supporting the derivation of medulloblastomas from cerebellar granule cells through activation of the Sonic Hedgehog (SHH) pathway was also revealed. We show further that the clinical outcome of children with medulloblastomas is highly predictable on the basis of the gene expression profiles of their tumours at diagnosis.

0 comments Cited 377 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Selection bias in gene extraction on the basis of microarray gene-expression data.

C Ambroise, G. J. McLachlan (2002)

In the context of cancer diagnosis and treatment, we consider the problem of constructing an accurate prediction rule on the basis of a relatively small number of tumor tissue samples of known type containing the expression data on very many (possibly thousands) genes. Recently, results have been presented in the literature suggesting that it is possible to construct a prediction rule from only a few genes such that it has a negligible prediction error rate. However, in these results the test error or the leave-one-out cross-validated error is calculated without allowance for the selection bias. There is no allowance because the rule is either tested on tissue samples that were used in the first instance to select the genes being used in the rule or because the cross-validation of the rule is not external to the selection process; that is, gene selection is not performed in training the rule at each stage of the cross-validation process. We describe how in practice the selection bias can be assessed and corrected for by either performing a cross-validation or applying the bootstrap external to the selection process. We recommend using 10-fold rather than leave-one-out cross-validation, and concerning the bootstrap, we suggest using the so-called .632+ bootstrap error estimate designed to handle overfitted prediction rules. Using two published data sets, we demonstrate that when correction is made for the selection bias, the cross-validated error is no longer zero for a subset of only a few genes.

0 comments Cited 326 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Gene expression correlates of clinical prostate cancer behavior

Dinesh Singh, Phillip G Febbo, Kenneth G. Ross … (2002)

0 comments Cited 184 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 2009-03-11

Publication date Updated: 2010-10-08

Article

DOI: 10.1214/09-AOAS277

ArXiV ID: 0903.2003

SO-VID: 712df92c-422b-43a7-894c-f66e6743e140

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Report No IMS-AOAS-AOAS277

Journal reference Annals of Applied Statistics 2010, Vol. 4, No. 1, 503-519

Comments Published in at http://dx.doi.org/10.1214/09-AOAS277 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Categories stat.AP stat.ML

Proxy vtex

ScienceOpen disciplines: Applications,Machine learning

Data availability:

ScienceOpen disciplines: Applications, Machine learning