There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Genome-wide association studies (GWASs) have identified hundreds of single nucleotide polymorphisms (SNPs) associated with complex human diseases. However, risk prediction models based on them have limited discriminatory accuracy. It has been suggested that including many such SNPs can improve predictive performance. Here, we studied various aspects of model building to improve discriminatory accuracy, as measured by the area under the receiver operating characteristic curve (AUC), including: (1) How well does a one-phase procedure that selects SNPs and estimates odds ratios on the same data perform? (2) How should training data be allocated between SNP selection (Phase 1) and estimation (Phase 2) in a two-phase procedure? (3) Should SNP selection be based on P-value thresholding or ranking P-values? (4) How many SNPs should be selected? and (5) Is multivariate estimation preferred to univariate estimation in the presence of linkage disequilibrium (LD)? We used realistic estimates of the distributions of genetic effect sizes, allele frequencies, and LD patterns based on GWAS data for Crohn's disease and prostate cancer. Theory and simulations were used to estimate AUC. Empirical risk models based on 10,000 cases and controls had considerably lower AUC than theoretically achievable. The most critical aspect of prediction model building was initial SNP selection. The single-phase procedure achieved higher AUC than the two-phase procedure. Multivariate estimation did not perform as well as univariate (marginal) estimation. For complex diseases and samples of 10,000 or fewer cases and controls, one should limit the number of SNPs to tens or hundreds. © 2013 WILEY PERIODICALS, INC.

Related collections

Author and article information

Journal

PubMed ID:: 24166696

DOI:: 10.1002/gepi.21762

ScienceOpen disciplines: Chemistry

Keywords: Area Under Curve,Crohn Disease,genetics,Gene Frequency,Genetic Predisposition to Disease,Genome-Wide Association Study,Genotype,Humans,Linkage Disequilibrium,Male,Models, Genetic,Odds Ratio,Polymorphism, Single Nucleotide,Prostatic Neoplasms,ROC Curve,Rare Diseases,Reproducibility of Results

Strategies for developing prediction models from genome-wide association studies.

Read this article at

Abstract

Related collections

ChemSpider related publications

Author and article information

Journal

Comments

Comment on this article

Similar content 209

Cited by 9