Inference of Population Structure using Dense Haplotype Data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from http://www.paintmychromosomes.com/.

Author Summary

The first step in almost every genetic analysis is to establish how sample members are related to each other. High relatedness between individuals can arise if they share a small number of recent ancestors, e.g. if they are distant cousins or a larger number of more distant ones, e.g. if their ancestors come from the same region. The most popular methods for investigating these relationships analyse successive markers independently, simply adding the information they provide. This works well for studies involving hundreds of markers scattered around the genome but is less appropriate now that entire genomes can be sequenced. We describe a “chromosome painting” approach to characterising shared ancestry that takes into account the fact that DNA is transmitted from generation to generation as a linear molecule in chromosomes. We show that the approach increases resolution relative to previous techniques, allowing differences in ancestry profiles among individuals to be resolved at the finest scales yet. We provide mathematical, statistical, and graphical machinery to exploit this new information and to characterize relationships at continental, regional, local, and family scales.

Related collections

Most cited references 57

Record: found
Abstract: found
Article: not found

Inference of Population Structure Using Multilocus Genotype Data

Jonathan Pritchard, Matthew Stephens, Peter Donnelly … (2001)

We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are K populations (where K may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.

0 comments Cited 2138 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Fast model-based estimation of ancestry in unrelated individuals.

David H. Alexander, John Novembre, Kenneth Lange (2009)

Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multi-locus genotype data, can be used to perform a statistical correction for population stratification. One popular technique for estimation of ancestry is the model-based approach embodied by the widely applied program structure. Another approach, implemented in the program EIGENSTRAT, relies on Principal Component Analysis rather than model-based estimation and does not directly deliver admixture fractions. EIGENSTRAT has gained in popularity in part owing to its remarkable speed in comparison to structure. We present a new algorithm and a program, ADMIXTURE, for model-based estimation of ancestry in unrelated individuals. ADMIXTURE adopts the likelihood model embedded in structure. However, ADMIXTURE runs considerably faster, solving problems in minutes that take structure hours. In many of our experiments, we have found that ADMIXTURE is almost as fast as EIGENSTRAT. The runtime improvements of ADMIXTURE rely on a fast block relaxation scheme using sequential quadratic programming for block updates, coupled with a novel quasi-Newton acceleration of convergence. Our algorithm also runs faster and with greater accuracy than the implementation of an Expectation-Maximization (EM) algorithm incorporated in the program FRAPPE. Our simulations show that ADMIXTURE's maximum likelihood estimates of the underlying admixture coefficients and ancestral allele frequencies are as accurate as structure's Bayesian estimates. On real-world data sets, ADMIXTURE's estimates are directly comparable to those from structure and EIGENSTRAT. Taken together, our results show that ADMIXTURE's computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies.

0 comments Cited 1902 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Principal components analysis corrects for stratification in genome-wide association studies.

Alkes L. Price, Nick Patterson, Robert Plenge … (2006)

Population stratification--allele frequency differences between cases and controls due to systematic ancestry differences-can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.

0 comments Cited 1274 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Genet

Journal ID (publisher-id): plos

Journal ID (pmc): plosgen

Title: PLoS Genetics

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Print): 1553-7390

ISSN (Electronic): 1553-7404

Publication date Collection: January 2012

Publication date (Print): January 2012

Publication date (Electronic): 26 January 2012

Volume: 8

Issue: 1

Electronic Location Identifier: e1002453

Affiliations

[1 ]Department of Mathematics, University of Bristol, Bristol, United Kingdom

[2 ]Wellcome Trust Center for Human Genetics, Oxford, United Kingdom

[3 ]Department of Statistics, University of Oxford, Oxford, United Kingdom

[4 ]Environmental Research Institute, University College Cork, Cork, Ireland

[5 ]Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany

The University of North Carolina at Chapel Hill, United States of America

Author notes

* E-mail: daniel_falush@ 123456eva.mpg.de

Conceived and designed the experiments: DJL GH SM DF. Analyzed the data: DJL GH SM DF. Wrote the paper: DJL GH SM DF. Implemented CHROMOPAINTER: GH. Implemented fineSTRUCTURE: DJL. Derived propositions: SM.

Article

Publisher ID: PGENETICS-D-11-01471

DOI: 10.1371/journal.pgen.1002453

PMC ID: 3266881

PubMed ID: 22291602

SO-VID: 90f5db60-3317-4988-a14d-ba4f63828300

Copyright © Lawson et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 13 July 2011

Date accepted : 21 November 2011

Page count

Pages: 16

Comments

Comment on this article

scite_

Cited by 487

See all cited by

Most referenced authors 1,838

See all reference authors

- Version 1

Inference of Population Structure using Dense Haplotype Data

Read this article at

Abstract

Author Summary

Related collections

Genome Engineering using CRISPR

Most cited references 57

Inference of Population Structure Using Multilocus Genotype Data

Fast model-based estimation of ancestry in unrelated individuals.

Principal components analysis corrects for stratification in genome-wide association studies.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 21

Cited by 487

Most referenced authors 1,838