Efficient toolkit implementing best practices for principal component analysis of population genetic data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

ABSTRACT

Motivation

Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.

Results

For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.

Availability and implementation

R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code.

Supplementary information

Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 38

Record: found
Abstract: found
Article: found

Is Open Access

A global reference for human genetic variation

Lachlan Coin, Robert Garry, Oleksyk Taras (2017)

The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

0 comments Cited 4079 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The UK Biobank resource with deep phenotyping and genomic data

Clare Bycroft, Colin Freeman, Desislava Petkova … (2018)

The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.

0 comments Cited 2433 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Principal components analysis corrects for stratification in genome-wide association studies.

Alkes L. Price, Nick Patterson, Robert Plenge … (2006)

Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.

0 comments Cited 1285 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Russell Schwartz: Role: Associate Editor

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date Collection: 15 August 2020

Publication date (Electronic): 16 May 2020

Publication date PMC-release: 16 May 2020

Volume: 36

Issue: 16

Pages: 4449-4457

Affiliations

[1 ] National Centre for Register-Based Research, Aarhus University , Aarhus 8210, Denmark

[2 ] Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes , La Tronche 38700, France

[3 ] OWKIN France , Paris 75010, France

[4 ] Queensland Brain Institute, University of Queensland , St. Lucia, 4072 Queensland, Australia

[5 ] Queensland Centre for Mental Health Research, The Park Centre for Mental Health , Wacol, 4076 Queensland, Australia

Author notes

To whom correspondence should be addressed. Email: florian.prive.21@ 123456gmail.com or bjv@ 123456econ.au.dk

Article

Publisher ID: btaa520

DOI: 10.1093/bioinformatics/btaa520

PMC ID: 7750941

PubMed ID: 32415959

SO-VID: 5e0064bd-d2fb-4813-b990-ca3edabb9b23

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 19 February 2020

Date revision received : 07 May 2020

Date accepted : 12 May 2020

Page count

Pages: 9

Funding

Funded by: Danish National Research Foundation, DOI 10.13039/501100001732;

Funded by: Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH;

Award ID: R248-2017-2003

Comments

Comment on this article

scite_

Cited by 41

See all cited by

Most referenced authors 1,411

See all reference authors

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Read this article at

ABSTRACT

Motivation

Results

Availability and implementation

Supplementary information

Related collections

Genetoberfest

Most cited references 38

A global reference for human genetic variation

The UK Biobank resource with deep phenotyping and genomic data

Principal components analysis corrects for stratification in genome-wide association studies.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 109

Cited by 41

Most referenced authors 1,411