+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: not found

      Strategies for developing prediction models from genome-wide association studies.

      Genetic Epidemiology

      Area Under Curve, Crohn Disease, genetics, Gene Frequency, Genetic Predisposition to Disease, Genome-Wide Association Study, Genotype, Humans, Linkage Disequilibrium, Male, Models, Genetic, Odds Ratio, Polymorphism, Single Nucleotide, Prostatic Neoplasms, ROC Curve, Rare Diseases, Reproducibility of Results

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.


          Genome-wide association studies (GWASs) have identified hundreds of single nucleotide polymorphisms (SNPs) associated with complex human diseases. However, risk prediction models based on them have limited discriminatory accuracy. It has been suggested that including many such SNPs can improve predictive performance. Here, we studied various aspects of model building to improve discriminatory accuracy, as measured by the area under the receiver operating characteristic curve (AUC), including: (1) How well does a one-phase procedure that selects SNPs and estimates odds ratios on the same data perform? (2) How should training data be allocated between SNP selection (Phase 1) and estimation (Phase 2) in a two-phase procedure? (3) Should SNP selection be based on P-value thresholding or ranking P-values? (4) How many SNPs should be selected? and (5) Is multivariate estimation preferred to univariate estimation in the presence of linkage disequilibrium (LD)? We used realistic estimates of the distributions of genetic effect sizes, allele frequencies, and LD patterns based on GWAS data for Crohn's disease and prostate cancer. Theory and simulations were used to estimate AUC. Empirical risk models based on 10,000 cases and controls had considerably lower AUC than theoretically achievable. The most critical aspect of prediction model building was initial SNP selection. The single-phase procedure achieved higher AUC than the two-phase procedure. Multivariate estimation did not perform as well as univariate (marginal) estimation. For complex diseases and samples of 10,000 or fewer cases and controls, one should limit the number of SNPs to tens or hundreds. © 2013 WILEY PERIODICALS, INC.

          Related collections

          Author and article information



          Comment on this article