17
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Efficient toolkit implementing best practices for principal component analysis of population genetic data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          ABSTRACT

          Motivation

          Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.

          Results

          For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

          Availability and implementation

          R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code.

          Supplementary information

          Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references38

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          A global reference for human genetic variation

          The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The UK Biobank resource with deep phenotyping and genomic data

            The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Principal components analysis corrects for stratification in genome-wide association studies.

              Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.
                Bookmark

                Author and article information

                Contributors
                Role: Associate Editor
                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                15 August 2020
                16 May 2020
                16 May 2020
                : 36
                : 16
                : 4449-4457
                Affiliations
                [1 ] National Centre for Register-Based Research, Aarhus University , Aarhus 8210, Denmark
                [2 ] Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes , La Tronche 38700, France
                [3 ] OWKIN France , Paris 75010, France
                [4 ] Queensland Brain Institute, University of Queensland , St. Lucia, 4072 Queensland, Australia
                [5 ] Queensland Centre for Mental Health Research, The Park Centre for Mental Health , Wacol, 4076 Queensland, Australia
                Author notes
                To whom correspondence should be addressed. Email: florian.prive.21@ 123456gmail.com or bjv@ 123456econ.au.dk
                Article
                btaa520
                10.1093/bioinformatics/btaa520
                7750941
                32415959
                5e0064bd-d2fb-4813-b990-ca3edabb9b23
                © The Author(s) 2020. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 19 February 2020
                : 07 May 2020
                : 12 May 2020
                Page count
                Pages: 9
                Funding
                Funded by: Danish National Research Foundation, DOI 10.13039/501100001732;
                Funded by: Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH;
                Award ID: R248-2017-2003
                Categories
                Original Papers
                Genetics and Population Analysis
                AcademicSubjects/SCI01060

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article