156
views
0
recommends
+1 Recommend
2 collections
    2
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Efficient Implementation of Penalized Regression for Genetic Risk Prediction

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Polygenic risk scores (PRS) combine many single-nucleotide polymorphisms into a score reflecting the genetic risk of developing a disease. Privé, Aschard, and Blum present an efficient implementation of penalized logistic regression...

          Abstract

          Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The “Clumping+Thresholding” (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.

          Most cited references25

          • Record: found
          • Abstract: not found
          • Article: not found

          Ridge Regression: Biased Estimation for Nonorthogonal Problems

            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            The inheritance of liability to certain diseases, estimated from the incidence among relatives

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Multiethnic polygenic risk scores improve risk prediction in diverse populations

              Methods for genetic risk prediction have been widely investigated in recent years. However, most available training data involves European samples, and it is currently unclear how to accurately predict disease risk in other populations. Previous studies have used either training data from European samples in large sample size or training data from the target population in small sample size, but not both. Here, we introduce a multiethnic polygenic risk score that combines training data from European samples and training data from the target population. We applied this approach to predict type 2 diabetes (T2D) in a Latino cohort using both publicly available European summary statistics in large sample size (Neff  = 40k) and Latino training data in small sample size (Neff  = 8k). Here, we attained a >70% relative improvement in prediction accuracy (from R2  = 0.027 to 0.047) compared to methods that use only one source of training data, consistent with large relative improvements in simulations. We observed a systematically lower load of T2D risk alleles in Latino individuals with more European ancestry, which could be explained by polygenic selection in ancestral European and/or Native American populations. We predict T2D in a South Asian UK Biobank cohort using European (Neff  = 40k) and South Asian (Neff  = 16k) training data and attained a >70% relative improvement in prediction accuracy, and application to predict height in an African UK Biobank cohort using European (N = 113k) and African (N = 2k) training data attained a 30% relative improvement. Our work reduces the gap in polygenic risk prediction accuracy between European and non-European target populations.
                Bookmark

                Author and article information

                Journal
                Genetics
                Genetics
                genetics
                genetics
                genetics
                Genetics
                Genetics Society of America
                0016-6731
                1943-2631
                May 2019
                26 February 2019
                : 212
                : 1
                : 65-74
                Affiliations
                [* ]Laboratoire TIMC-IMAG, UMR 5525, University of Grenoble Alpes, CNRS, 38700 La Tronche, France
                []Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, 75015 Paris, France
                Author notes
                [1 ]Corresponding authors: Laboratoire TIMC-IMAG, UMR 5525, Université Grenoble Alpes, CNRS, 5 Ave. du Grand Sablon, 38700 La Tronche, France. E-mail: florian.prive@ 123456univ-grenoble-alpes.fr ; and michael.blum@ 123456univ-grenoble-alpes.fr
                Article
                302019
                10.1534/genetics.119.302019
                6499521
                30808621
                3833581f-2893-400f-b522-ece422615b6a
                Copyright © 2019 Privé et al.

                Available freely online through the author-supported open access option.

                This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 11 October 2018
                : 22 February 2019
                Page count
                Figures: 3, Tables: 3, Equations: 3, References: 40, Pages: 10
                Categories
                Investigations
                Genomic Prediction
                Custom metadata
                highlight-article

                Genetics
                polygenic risk scores,snp,lasso,genomic prediction,genpred,shared data resources
                Genetics
                polygenic risk scores, snp, lasso, genomic prediction, genpred, shared data resources

                Comments

                Comment on this article