20
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Transfer learning with false negative control improves polygenic risk prediction

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Polygenic risk score (PRS) is a quantity that aggregates the effects of variants across the genome and estimates an individual’s genetic predisposition for a given trait. PRS analysis typically contains two input data sets: base data for effect size estimation and target data for individual-level prediction. Given the availability of large-scale base data, it becomes more common that the ancestral background of base and target data do not perfectly match. In this paper, we treat the GWAS summary information obtained in the base data as knowledge learned from a pre-trained model, and adopt a transfer learning framework to effectively leverage the knowledge learned from the base data that may or may not have similar ancestral background as the target samples to build prediction models for target individuals. Our proposed transfer learning framework consists of two main steps: (1) conducting false negative control (FNC) marginal screening to extract useful knowledge from the base data; and (2) performing joint model training to integrate the knowledge extracted from base data with the target training data for accurate trans-data prediction. This new approach can significantly enhance the computational and statistical efficiency of joint-model training, alleviate over-fitting, and facilitate more accurate trans-data prediction when heterogeneity level between target and base data sets is small or high.

          Author summary

          Polygenic risk score (PRS) can quantify the genetic predisposition for a trait. PRS construction typically contains two input datasets: base data for variant-effect estimation and target data for individual-level prediction. Given the availability of large-scale base data, it becomes common that the ancestral background of base and target data do not perfectly match. In this paper, we introduce a PRS method under a transfer learning framework to effectively leverage the knowledge learned from the base data that may or may not have similar background as the target samples to build prediction models for target individuals. Our method first utilizes a unique false-negative control strategy to extract useful information from base data while ensuring to retain a high proportion of true signals; it then applies the extracted information to re-train PRS models in a statistically and computationally efficient fashion. We use numerical studies based on simulated and real data to show that the proposed method can increase the accuracy and robustness of polygenic prediction across different ranges of heterogeneities between base and target data and sample sizes, reduce computational cost in model re-training, and result in more parsimonious models that can facilitate PRS interpretation and/or exploration of complex, non-additive PRS models.

          Related collections

          Most cited references40

          • Record: found
          • Abstract: not found
          • Article: not found

          Regression Shrinkage and Selection Via the Lasso

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Heart Disease and Stroke Statistics—2021 Update: A Report From the American Heart Association

            The American Heart Association, in conjunction with the National Institutes of Health, annually reports the most up-to-date statistics related to heart disease, stroke, and cardiovascular risk factors, including core health behaviors (smoking, physical activity, diet, and weight) and health factors (cholesterol, blood pressure, and glucose control) that contribute to cardiovascular health. The Statistical Update presents the latest data on a range of major clinical heart and circulatory disease conditions (including stroke, congenital heart disease, rhythm disorders, subclinical atherosclerosis, coronary heart disease, heart failure, valvular disease, venous disease, and peripheral artery disease) and the associated outcomes (including quality of care, procedures, and economic costs). The American Heart Association, through its Statistics Committee, continuously monitors and evaluates sources of data on heart disease and stroke in the United States to provide the most current information available in the annual Statistical Update. The 2021 Statistical Update is the product of a full year’s worth of effort by dedicated volunteer clinicians and scientists, committed government professionals, and American Heart Association staff members. This year’s edition includes data on the monitoring and benefits of cardiovascular health in the population, an enhanced focus on social determinants of health, adverse pregnancy outcomes, vascular contributions to brain health, the global burden of cardiovascular disease, and further evidence-based approaches to changing behaviors related to cardiovascular disease. Each of the 27 chapters in the Statistical Update focuses on a different topic related to heart disease and stroke statistics. The Statistical Update represents a critical resource for the lay public, policy makers, media professionals, clinicians, health care administrators, researchers, health advocates, and others seeking the best available data on these factors and conditions.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019

              Abstract The GWAS Catalog delivers a high-quality curated collection of all published genome-wide association studies enabling investigations to identify causal variants, understand disease mechanisms, and establish targets for novel therapies. The scope of the Catalog has also expanded to targeted and exome arrays with 1000 new associations added for these technologies. As of September 2018, the Catalog contains 5687 GWAS comprising 71673 variant-trait associations from 3567 publications. New content includes 284 full P-value summary statistics datasets for genome-wide and new targeted array studies, representing 6 × 109 individual variant-trait statistics. In the last 12 months, the Catalog's user interface was accessed by ∼90000 unique users who viewed >1 million pages. We have improved data access with the release of a new RESTful API to support high-throughput programmatic access, an improved web interface and a new summary statistics database. Summary statistics provision is supported by a new format proposed as a community standard for summary statistics data representation. This format was derived from our experience in standardizing heterogeneous submissions, mapping formats and in harmonizing content. Availability: https://www.ebi.ac.uk/gwas/.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: Funding acquisitionRole: InvestigationRole: MethodologyRole: Project administrationRole: ResourcesRole: SoftwareRole: SupervisionRole: ValidationRole: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: Writing – original draftRole: Writing – review & editing
                Role: Formal analysisRole: InvestigationRole: Writing – review & editing
                Role: Formal analysisRole: InvestigationRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: Funding acquisitionRole: InvestigationRole: MethodologyRole: Project administrationRole: SupervisionRole: Writing – original draftRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Genet
                PLoS Genet
                plos
                PLOS Genetics
                Public Library of Science (San Francisco, CA USA )
                1553-7390
                1553-7404
                November 2023
                27 November 2023
                : 19
                : 11
                : e1010597
                Affiliations
                [1 ] Department of Statistics, North Carolina State University, Raleigh, North Carolina, United States of America
                [2 ] Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
                [3 ] Institute of Health Data Analytics and Statistics, National Taiwan University, Taipei, Taiwan
                [4 ] Department of Public Health, National Taiwan University, Taipei, Taiwan
                Emory University, UNITED STATES
                Author notes

                The authors have declared that no competing interests exist.

                [¤]

                Current address: Department of Statistics and Bioinformatics Research Center, Campus Box 7566, Raleigh NC, United States of America

                Author information
                https://orcid.org/0009-0004-8206-9403
                https://orcid.org/0000-0002-5505-1775
                Article
                PGENETICS-D-23-00002
                10.1371/journal.pgen.1010597
                10723713
                38011285
                26eefbf8-b83a-4ac5-a3d6-66d1757cba7e
                © 2023 Jeng et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 1 January 2023
                : 9 November 2023
                Page count
                Figures: 3, Tables: 3, Pages: 17
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/100000009, Foundation for the National Institutes of Health;
                Award ID: RF1AG074328
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000009, Foundation for the National Institutes of Health;
                Award ID: U24AG041689
                Award Recipient :
                This work is partially supported by National Institutes of Health ( https://www.nih.gov/) grants RF1AG074328 and U24AG041689 to JYT. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Genome-Wide Association Studies
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Genome-Wide Association Studies
                Biology and Life Sciences
                Genetics
                Human Genetics
                Genome-Wide Association Studies
                Research and Analysis Methods
                Simulation and Modeling
                Research and Analysis Methods
                Mathematical and Statistical Techniques
                Statistical Methods
                Forecasting
                Physical Sciences
                Mathematics
                Statistics
                Statistical Methods
                Forecasting
                Biology and Life Sciences
                Genetics
                Single Nucleotide Polymorphisms
                Biology and Life Sciences
                Biochemistry
                Lipids
                People and Places
                Geographical Locations
                Europe
                Physical Sciences
                Mathematics
                Statistics
                Statistical Data
                Medicine and Health Sciences
                Epidemiology
                Medical Risk Factors
                Custom metadata
                vor-update-to-uncorrected-proof
                2023-12-15
                • For simulation data, the relevant R code used in the simulation are available at https://github.com/JessieJeng/FNC-Lasso. • For the CoLaus/PsyCoLaus data, it can be applied from the database of Genotypes and Phenotypes (dbGaP) (dbGaP Study Accession: phs000145.v4.p2) at NCBI website: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000145.v4.p2. • The numerical results underlying the tables and figures are included as supporting materials ( S2 Table).

                Genetics
                Genetics

                Comments

                Comment on this article