65
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Spatial Localization of Recent Ancestors for Admixed Individuals

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Ancestry analysis from genetic data plays a critical role in studies of human disease and evolution. Recent work has introduced explicit models for the geographic distribution of genetic variation and has shown that such explicit models yield superior accuracy in ancestry inference over nonmodel-based methods. Here we extend such work to introduce a method that models admixture between ancestors from multiple sources across a geographic continuum. We devise efficient algorithms based on hidden Markov models to localize on a map the recent ancestors ( e.g., grandparents) of admixed individuals, joint with assigning ancestry at each locus in the genome. We validate our methods by using empirical data from individuals with mixed European ancestry from the Population Reference Sample study and show that our approach is able to localize their recent ancestors within an average of 470 km of the reported locations of their grandparents. Furthermore, simulations from real Population Reference Sample genotype data show that our method attains high accuracy in localizing recent ancestors of admixed individuals in Europe (an average of 550 km from their true location for localization of two ancestries in Europe, four generations ago). We explore the limits of ancestry localization under our approach and find that performance decreases as the number of distinct ancestries and generations since admixture increases. Finally, we build a map of expected localization accuracy across admixed individuals according to the location of origin within Europe of their ancestors.

          Related collections

          Most cited references 22

          • Record: found
          • Abstract: found
          • Article: not found

          Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations

          Introduction The identification of chromosomal segments of distinct continental ancestry in admixed populations is an important problem, with a wide range of applications from disease mapping to understanding human history. Early efforts to solve this problem used coarse sets of unlinked markers [1]–[3] and mostly focused on populations such as African Americans [4],[5] and Latinos [6]–[8] that admixed within the past approximately 10 generations. Applying this approach to more anciently admixed populations has led to ancestry predictions that are ambiguous at many loci [9]. However, methods based on coarse sets of markers do not take advantage of the much richer haplotype information available in genome-wide data. More recent methods have been designed to use data from genome-wide scanning arrays [10]–[12], but these methods do not fully model linkage disequilibrium (LD) in the ancestral populations. Thus, they do not capture all of the available information about ancestry, and can be far from optimal. Furthermore, unless a trimming step is applied to remove linked markers [11], unmodeled LD may cause systematic biases in estimated ancestry, leading to false-positive inferences of a deviation in ancestry at certain loci [13]. Here, we describe a haplotype-based method, HAPMIX, which applies an extension of the population genetic model of Li and Stephens [14] to the problem of local ancestry inference in populations formed by two way admixture. We apply the method to simulated mixtures of African and European chromosomes to show that the resulting local ancestry inference is exceedingly accurate in comparison to other methods, even in the case of ancient admixture in which the shorter ancestry segments are more difficult to infer. As expected from its use of an explicit population genetic model, HAPMIX makes more complete use of dense genome-wide data, producing more accurate results. We examine the sensitivity of local ancestry inference to a wide array of factors. We also explore the utility of HAPMIX for drawing inferences about both the ancestral populations and the date of admixture. We apply HAPMIX to 935 African American individuals genotyped at ∼650,000 markers. By studying a large set of individuals from an admixed population of high relevance to disease mapping, we validate the effectiveness of this method in a practical setting and specifically show that the ancestry estimates are not systematically biased within the limits of our resolution. To illustrate how the method can provide insights into the history of an anciently admixed population, we also apply HAPMIX to a data set of 29 individuals from the Mozabite population of northern Africa that were genotyped at ∼650,000 markers as part of the Human Genome Diversity Panel (HGDP) [15]. We show that the Mozabite have inherited roughly 78% ancestry from a European-related population and 22% ancestry from a population related to sub-Saharan Africans. Our analysis also shows that the Mozabite admixture has occurred over a period that began at least 100 generations ago (∼2,800 years ago), and that has continued into the present day. We are able to infer small, ancient, ancestry segments in the Mozabite, and we demonstrate that the segments show considerable drift relative to all the other HGDP populations, consistent with the historical isolation of the Mozabite population. Materials and Methods Ethics statement For the African American data, informed consent was obtained from each study participant, and the study protocol was approved by the institutional review board at either the Johns Hopkins University or Howard University. Overview of haplotype-based inference of local ancestry HAPMIX assumes that the admixed population being analyzed has arisen from the admixture of two ancestral populations, and that phased data are available from unadmixed reference populations that are closely related to the true ancestral populations (e.g. phased data from HapMap [16]). In theory, discrepancies between the reference populations and the true ancestral populations may lead to inaccuracies, but in practice HAPMIX is robust to this concern under a variety of realistic scenarios (see below). The central idea of the method is to view haplotypes of each admixed individual as being sampled from the reference populations: for example, haplotypes of an African American individual could be sampled from phased African and European chromosomes from HapMap. At each position in the genome, HAPMIX estimates the likelihood that a haplotype from an admixed individual is a better statistical match to one reference population or the other. A Hidden Markov Model (HMM) is used to combine these likelihoods with information from neighboring loci, to provide a probabilistic estimate of ancestry at each locus. The method allows transition at two scales. The small-scale transitions are between haplotypes from within a reference population, typically at a scale of every few tens of thousands of bases [14]. The large-scale transitions are between the reference populations, at a scale of up to tens of millions of bases for a recently admixed population such as African Americans. Figure 1 illustrates the method schematically. 10.1371/journal.pgen.1000519.g001 Figure 1 Schematic of the Markov model we use for ancestry inference. The black lower line represents a chromosomal segment from an admixed individual, carrying a number of typed mutations (black circles). The underlying ancestry is shown in the bottom color bar, and reveals an ancestry change from the first population (red) to the second population (blue). The admixed chromosome is modeled as a mosaic of segments of DNA from two sets of individuals drawn from different reference populations (red and blue horizontal lines respectively) closely related to the progenitor populations for the admixture event. The yellow line shows how the admixed chromosome is constructed in terms of this mosaic. The dotted line above the bottom color bar shows the reference population being copied from along the chromosome – note that at most positions, this is identical to the true underlying ancestry, but with occasional “miscopying” from the other population (blue dotted segment occurring within red ancestry segment). Note also that switches between chromosomes being copied from, representing historical recombinations, are rapid (6 switches), while ancestry changes, representing recombination since admixture, are much rarer (1 switch). Finally, note that at most positions the type of the admixed chromosome is identical to that of the chromosome being copied from, but an exception to this occurs at one site, shown as a grey circle, and representing mutation or genotyping error. In our inference framework, we observe only the variation data for the admixed and reference individuals: the yellow line, and the underlying ancestry, must be inferred as the hidden states in a HMM. An important strength of HAPMIX is the way it analyzes diploid data from admixed individuals. A naïve way to use population genetic methods to infer ancestry would be to pre-process such a data set using phasing software, and then to assume that this guess about the underlying phased haplotype is correct. However, phase switch errors that arise from this procedure (which are common even with the best phasing algorithms [17],[18]) would inappropriately force the method to infer ancestry transitions. HAPMIX circumvents this problem by not assuming that any one haplotype phase solution is correct. Instead, it uses a built-in phasing algorithm, similar to that of [17], which allows it to average inferences about ancestry over all possible phase solutions within each admixed individual. We treat the reference populations as fully phased, partly because in some cases, e.g. African and European chromosomes from HapMap, this phasing uses unambiguous trio information and is therefore highly accurate. More importantly, we expect our approach to be robust to errors in phasing in the reference populations, because these are unlikely to force inappropriate ancestry switches, in contrast to phasing errors in the admixed data itself. HAPMIX is also notable in inferring probabilities for whether an individual has 0, 1, or 2 alleles of a particular ancestry at each locus. As our simulations show, these estimates are well-calibrated. Thus, when the method generates a probability p for an individual being heterozygous for ancestry at a locus, they are in fact heterozygous approximately this proportion of the time. A well-calibrated probability of ancestry at each locus is important for a variety of applications, and also allows us to evaluate the robustness of the results. HAPMIX is fundamentally different from existing methods such as ANCESTRYMAP and LAMP [1],[11]. ANCESTRYMAP applies a Hidden Markov Model to unlinked SNPs to model ancestry transitions, while LAMP computes a majority vote of ancestry information using windows of unlinked SNPs, but neither of those methods makes use of haplotype information. Another method for investigating admixture segments, HAPAA, has recently been published [19]. In common with HAPMIX, the HAPAA software uses a Hidden Markov Model to model linkage disequilibrium within populations, and infers ancestry segments. However, there are also a number of important differences between our model and that used by HAPAA. First, unlike HAPAA, we allow for some rate of miscopying of ancestry segments from the “wrong” population, which we have found greatly improves our ancestry estimation (instead of this, the HAPAA software uses a post-hoc “filtering” of inferred segments, which removes all segments of size below a certain minimum threshold). Second, we fully allow for unphased data in our model, while the HAPAA approach requires a prior phasing of the data, and then attempts to account for the effect of phase-flip errors on ancestry inference via a heuristic procedure. We believe that these features of HAPMIX are likely to be critical in unraveling older admixture events, where ancestry segments are much shorter. A final advantage of HAPMIX over HAPAA is that it is designed to produce accurate estimates of uncertainty in inferred segments, even for old admixture events. Details of haplotype-based inference of local ancestry Modeling genetic variation in admixed populations Our approach to inferring ancestry segments, implemented in HAPMIX, is based on extending a Hidden Markov Model (HMM) previously developed by Li and Stephens to model linkage disequilibrium in population genetic data [14]. This model has been employed in recent years in various population genetic and disease mapping settings [20],[21]. Informally, given a previous collection of “parental” haplotypes from a reference population, a new “offspring” haplotype drawn from the same population is modeled as a mosaic of these existing haplotypes. This offers a flexible means to account for local linkage disequilibrium (LD), because over short distances, the haplotype that an individual chromosome copies from is unlikely to change. We extend the Li and Stephens model to allow inference on ancestry segments for individuals drawn from an admixed population. We begin by supposing that we have two previously sampled collections of phased haplotypes, P 1 and P 2, taken from two reference populations. For example, HapMap provides phased haplotypes from the CEU, YRI and JPT+CHB populations genotyped at over 3 million markers [16]. We further assume that P 1 and P 2 have valid data at all sites of interest, with no missing data. In practice, small amounts of missing data in the reference populations can be filled in by a pre-processing imputation step, as has been done for the publicly available phased HapMap data. We label P 1 and P 2 as “parental” haplotypes. Next, we sample a new “offspring” haplotype from an admixed population. We assume that this population is created from a single admixture event between two populations which are genetically similar to the two reference populations from which P 1 and P 2 are drawn. (The reference populations do not need to exactly match the true ancestral populations, because we allow for some genetic divergence in our approach.) We will initially consider the case where we have haploid chromosomes from the admixed population, and subsequently generalize to the more typical case involving unphased genotype data from the admixed population. Throughout this section, we operate in units of genetic (not physical) distance. We begin by modeling the ancestry segments. Assume the admixture event occurred at a single time T generations ago, with a fraction μ 1 of the haplotype's ancestry drawn from population 1, and μ 2 = 1−μ 1 from population 2. Because recombination occurs at each generation, it is natural to model ancestry switches as a Poisson process along the genome [22], at a rate T per unit of genetic distance (i.e. T per Morgan). Conditional on the positions of such switches, each segment is independently drawn from population 1 or 2 with probabilities μ 1, μ 2 respectively. In particular, this implies that not all ancestry switch points will actually change the underlying ancestry. This model has been previously used by other authors [1],[22]. Since ancestry cannot be directly observed, it is natural to view underlying ancestry status as the “hidden” information in an HMM. Our approach probabilistically infers this hidden state at each position along a chromosome. To fully specify our model, we must consider the structure of variation conditional on these admixture segments. Our model remains computationally tractable while accommodating important features typical of real data such as mutation, recombination, genotyping error, reference populations that are drifted from the true ancestral populations, and incomplete sampling of diversity in the reference populations reflected in the samples drawn from these populations. We assume that all mutant sites take the form of single nucleotide polymorphisms (SNPs) with two alleles that can be represented as 0 and 1 (however, our approach could be extended to more complex mutation models). We suppose that sections of the genome with true ancestry from population 1 are formed as mosaics of the haplotypes in the two parental groups. Specifically, at any given position with this ancestry, an individual from P 1 is copied with probability, and an individual from population P 2 is copied with probability (we call this the “miscopying” parameter for population 1). Conditional on the parental group chosen, individuals to copy from are chosen uniformly from the n 1, n 2 respective individuals in that group. Switches between individuals occur as a Poisson process with rate ρ 1, the “recombination” parameter, and at each switch point a new copy individual is chosen randomly using the above scheme. Finally, at genotyped SNPs, if the “offspring” copies a “parent” population 1, the offspring carries an identical type to the particular parent it copies from with probability (1−θ 1), and carries the other type with probability θ 1, the “mutation” parameter. If the offspring instead copies an individual from the other population 2, the corresponding mutation parameter is θ 3. In total this approach leads to 4 additional parameters: p 1, ρ 1, θ 1 and θ 3. For sections of the genome with ancestry from population 2, we formulate our model in an analogous way, with corresponding parameters p 2, ρ 2, θ 2 and θ 3. We note that θ 3 is shared for both populations, a choice that is motivated by a genealogical argument, and has the aim of keeping the total number of parameters manageable. In total, our model has 9 independent parameters: T, μ 1, p 1, p 2, ρ 1, ρ 2, θ 1, θ 2 and θ 3. Some additional remarks about the interpretation of these parameters may be useful. As in the original Li and Stephens implementation, ρ 1 and ρ 2 relate to historical recombination parameters. In our parameterization, these parameters depend on both the effective population sizes of the relevant populations, and the sample sizes n 1 and n 2 drawn from these populations. Although they are not merely a simple function of these quantities, informal coalescent-based arguments suggest that they will decrease roughly linearly with n 1 and n 2, and increase roughly linearly with the effective population sizes of the reference populations [14]. In general, because the amount of historical recombination depends on effective population size, we do not expect ρ 1 = ρ 2, even if n 1 = n 2. The mutation parameters θ 1, θ 2 and θ 3 allow for both historical mutation and genotyping error. The miscopying parameters p 1 and p 2 allow similar “fuzziness” in the group copied from within ancestry segments. If , ancestry segments corresponding to population 1 must copy individuals from population 1, and similarly for population 2. However, setting these parameters equal to zero is likely to lead to spurious ancestry breaks, and therefore misestimation of ancestry segments, for at least two reasons. First, because we only sample a finite number of parental chromosomes, incomplete lineage sorting can occur. In some parts of the genome, the offspring chromosome is expected to have a deep coalescence time with the ancestors of the “correct” parental sample, and may instead coalesce first with an ancestor of the other parental sample – and therefore choose a descendant of this ancestor, in the “wrong” parental sample, to copy from. Second, if our reference populations are somewhat inaccurate relative to the true ancestral populations, again it is likely that incomplete lineage sorting will occur, even if our “parental” samples are both large. For these reasons, in practice we believe that incorporating non-zero miscopying parameters is important, and in both real data and simulation we find that it greatly improves our ancestry estimation procedure. Because our miscopying parameter is designed to allow for regions in the genome where the offspring chromosome has an unusually deep coalescence time with the other sample members, allowing the “miscopying” to occur, miscopied regions are likely to have unusually deep genealogies. Therefore, we allow a different mutation rate θ 3 for such segments, which is typically expected to be higher than θ 1 or θ 2. It might also be desirable to allow a higher recombination rate in such cases. However, this would result in computational complexities, and we have chosen not to allow such an additional parameter. For a typical application of HAPMIX, we expect to have data from a collection of discrete typed sites. Suppose we have S such sites, and in addition a map giving the genetic distances r1 ,r2 ,…r(S−1) between adjacent pairs of sites. In practice, we interpolate these distances from the genome-wide recombination rates estimated using Phase II HapMap [16]. Given the above parameters, and for a haploid admixed chromosome, we formalize the transition probabilities as follows. A (hidden) state for position s is represented by a triplet (i,j,k) where i = 1 or 2 represents ancestry drawn from population 1 or population 2, j = 1 or 2 records the population the chromosome copies from at position s (j may be different from i due to miscopying) and k represents the individual from which the chromosomal segment is copied. There are 2(n1 +n2 ) possible states. Let be the probability of transitioning from state (i,j,k) to state (l,m,n) between adjacent sites s and (s+1). Then we have the following: (0.1) Conditional on the underlying hidden state, let denote the probability of the offspring chromosome being of type 1 at site s, and tjk be the type of parental individual k in reference population j. Then (0.2) This probability allows us to calculate the likelihood of the observed data in the offspring for each possible underlying state. At sites with missing data in the offspring chromosome, the appropriate likelihood contribution is simply 1.0. Choices of parameter settings Choices of T and μ 1 are specific to each application (see below). However, many of the remaining parameters were fixed in all analyses of both simulated and real data. As discussed above, it is natural to scale ρ 1 and ρ 2, as well as θ 1 and θ 2, by the numbers of parental individuals n1 , n2 , respectively. Our code is parameterized so this is done internally – arbitrarily labeling the European population as ancestral population 1, we used recombination parameters ρ 1 = 60,000/n1 per Morgan for the European ancestral population and ρ 2 = 90,000/n2 per Morgan for the African ancestral population (with ρ 2>ρ 1 reflecting the larger effective population size of Africans). Further, we set θ 1 = 0.2/(0.2+n1 ) and θ 2 = 0.2/(0.2+n2 ), and θ 3 = 0.01 (this parameter remains unscaled). Finally, we used miscopying parameters p 1 = p 2 = 0.05. These values were arrived at via a process of trial and error, based on the results of inferring parameters via the EM algorithm. We have implemented an EM algorithm approach to parameter estimation that can infer any subset of the HAPMIX input parameters, or all simulataneously (see Text S2). This EM approach to parameter inference is currently only implemented for haploid data from the admixed population, but we applied it to haploid data derived from a phasing of diploid data, obtained by running HAPMIX on diploid admixed samples and using the software to sample random state paths. This approach might be applied to diploid samples more generally, and could be potentially be iterated, by updating phasing based on new parameter sets. However, based on our simulations we believe that for many applications – for example whenever the software is applied to African American data - it will be sufficient to vary T and μ 1 and fix the remaining parameters at the values described above. Inferring probabilistic ancestry segments and sampling from the posterior with HAPMIX It is easy to see that equations (0.1) and (0.2) describe a HMM for the underlying state (which includes information on ancestry) as we move along the genome, and that the underlying Markov process is reversible. Given a set of parameters we can exploit these properties and HAPMIX implements standard HMM techniques to efficiently infer posterior probabilities of underlying states, via the forward-backward algorithm, or sample random state paths from the correct joint posterior distribution, using a standard modification of this algorithm. In addition to parameter values, the software takes as input a recombination map for the regions to be analyzed, phased “parental” chromosomes from the two reference populations, and “offspring” data from the admixed population being analyzed. A naïve implementation of the forward/backward algorithm would require computation time proportional to 4S(n1 +n2 )2, in the above notation. For the original Li and Stephens model, it is possible to reduce computation time substantially by using the fact that many pairs of transition probabilities between states are identical, which allows terms to be collapsed in the forward (or backward) algorithm, into expressions involving a single term that is shared among all destination states. Calculating this shared term just once per pair of adjacent sites, and then storing, saves substantial computational effort [14]. Analogously, in our somewhat more complicated setting we can exploit a similar phenomenon, so that by calculating and storing a somewhat larger number of shared terms – one for each group of states of the form (i,j), giving four in total - HAPMIX can complete the forward/backward algorithm in time proportional to 2S(n1 +n2 ) (with an additional scaling constant). It is straightforward to extend our approach to allow imputation of missing data, while simultaneously labeling underlying ancestry, in an analogous manner to methods employed in several existing approaches to imputation for samples drawn from panmictic populations [20],[21]. We will describe this extension, and its application to disease mapping, in a separate paper. Multiple individuals from the admixed population Typically, we actually have multiple “offspring” samples (either haploid chromosomes or diploid genotypes, see below) from the admixed population of interest. For the analyses in this paper, we used HAPMIX to analyze data from each sample independently, using the same parental chromosomes in each case. Although in principle improvements to ancestry inference could result from considering the problem in multiple samples jointly, there are formidable computational challenges in adapting our approach to allow this (one possibility might be to employ MCMC, as used for unlinked sites [22],[23]). To avoid these complications, we simply model each admixed sample independently, following [21]. Under this scheme, separate HAPMIX runs for each sample enable effective parallelization of the software. Diploid genotype data from the admixed population Typically, real data consists of unphased genotypes for individuals drawn from a population, with haplotypic phase unknown. Many approaches already exist to infer phase from such data [17],[18]. However, phase switch errors that inevitably result from applying such algorithms are likely to result in spurious ancestry switches within regions of the genome where an individual is heterozygote for ancestry. This would likely lead to considerable overestimation of the time since admixture and a reduction in the accuracy of ancestral inference. To avoid such issues, we have extended our approach to directly analyze diploid genotype data from the admixed population. The phasing is implemented using a HMM adapted from that described above (0.1) and employing a composite hidden state at each location, of the form (i1 ,j1 ,k1 ,i2 ,j2 ,k2 ) where (i1 ,j1 ,k1 ) represents the previously defined “haploid” hidden state for the first chromosome, and (i2 ,j2 ,k2 ) represents the hidden state for the second chromosome. The state space therefore now has dimension 4(n 1+n 2)2. Allowing independent transitions between the marginal states for each chromosome, the terms in (0.1) now naturally define an HMM for these composite states (for reasons of space, we do not explicitly list all of the transition probabilities in the model here). This model could have up to 18 parameters – in our implementation, for natural biological reasons we assume all parameters are shared between chromosomes, apart from time since admixture T and admixture proportion μ 1, resulting in 11 parameters in total. Further, although our software allows these two parameters to differ, in all applications considered here we specify T and μ 1 to be the same for each chromosome. Emission probabilities are also adapted from the haploid case. For genotype data, there are 3 possible emissions at typed sites, which we denote as genotypes g = 0, 1, or 2, with g counting copies of the “1” allele. Conditional on the underlying hidden state, let denote the probability of observing genotype g given underlying state (i,j,k,l,m,n), and define tjk as before to be the type of parental individual k in reference population j. Then using (0.2) (0.3) where and are as defined above. Having defined the HMM for this setting, we again use standard techniques to obtain posterior probabilities on (joint) ancestry for the two chromosomes, and then sample states from this posterior distribution. We note that as a by-product of sampling complete states jointly for the two chromosomes together, we are phasing the original data with respect to the underlying ancestry. This may help reduce phasing error rates in admixed populations compared to methods that ignore local ancestry, although we do not pursue this issue here. We can adapt the computational speedups described above to the diploid setting, so that while a naïve implementation of the forward algorithm would take time proportional to 16S(n 1+n 2)4, we can complete the forward/backward algorithm in time proportional to 4S(n 1+n 2)2. A further speedup for the diploid setting is described in Text S2. With these speedups implemented, the running time of HAPMIX is roughly 30 minutes on a single processor per diploid genome analyzed (519,248 sites). Because the computations can be parallelized across admixed individuals (they can also be parallelized across chromosomes), HAPMIX is computationally tractable even for very large data sets if a cluster of computing nodes is available. For example, the running time for a data set of 1,000 admixed individuals on a cluster of 100 nodes is roughly 5 hours. Measuring the performance of HAPMIX Estimate of r 2 between predicted and true ancestry Irrespective of whether the true ancestry is known (as in simulations) or unknown (as in real data), an estimate of the r 2 between a predicted ancestry vector Y and true ancestry X can be computed. Within an individual, at each site s, a natural measure of predicted ancestry is the expected number Y s of haplotypes from one of the two source populations. If HAPMIX provides accurate ancestry probabilities, the true number of haplotypes from this population, X s , can be thought of as an unknown random variable which is equal to 0, 1, or 2 with probabilities p 0, p 1, p 2 specified by the ancestry predictions. We are interested in how correlated the predicted ancestry Y and true ancestry X are, over samplings from this distribution of the true ancestry X. A natural way to estimate this correlation is to calculate the expected squared correlation between X and Y, which we may approximate using a ratio of means: where the variances and covariances are taken over loci and individuals, and the expectations over samplings of the ancestry X. The expected covariance between predicted and true ancestry is then the mean value of the covariance between X and Y as we sample ancestry paths at different loci and in different individuals. At our single locus, we have E(XsYs ) = (p 1+2p 2)2 and E(Xs ) = E(Ys ) = p 1+2p 2. By separately averaging these three expectations across loci and individuals, we can then calculate analytically. Similarly, we can calculate the variance of Y, and the expected variance of X, across loci and different individuals, in a similar way. Combining these variances with the covariance to estimate correlation, and then squaring, we obtain a measure of the level of certainty of the ancestry predictions. Actual r 2 between predicted and true ancestry In simulated data sets where the true ancestry is known, the estimated r 2 between predicted and true ancestry (which is computed using ancestry predictions only) can be compared to the actual r 2 between these quantities (comparing ancestry predictions to true ancestries specified in simulations). As we confirm in what follows, the estimates of r 2 are well calibrated. Simulations Simulations of local ancestry inference We simulated individuals of admixed African and European ancestry by constructing their genomes from a mosaic of real Yoruba and French individuals genotyped on the Illumina 650Y chip as part of the Human Genome Diversity Panel (HGDP) [15]. We downloaded data from 20 Yoruba and 20 French individuals from the HGDP data set and jointly phased them using the fastPHASE program [18] to form 40 haploid Yoruba and 40 haploid French genomes. We constructed 40 haploid admixed genomes (n = 1 to 40) from the 40 haploid Yoruba and 40 haploid French genomes by using haploid Yoruba genome n and haploid French genome n to construct admixed genome n, so that ancestral genomes were never reused. To construct an admixed genome, we began at the first marker on each chromosome and sampled French ancestry with probability α and Yoruba ancestry with probability 1-α. Ancestry was resampled based on an exponential distribution with weight λ (the number of generations since admixture) so that a new ancestry was sampled with probability 1−e−λg when traversing a genetic distance of g Morgans. Each time ancestry was resampled, we sampled French ancestry with probability α and Yoruba ancestry with probability 1-α. For each individual, we used a value of α to apply to the entire genome by sampling from a beta distribution with mean 0.20 and standard deviation 0.10 (typical for African Americans [4]). We simulated values of λ = 6 (typical for African Americans [4]) as well as higher values of λ: 10, 20, 40, 60, 100, 200 and 400. Pairs of haploid admixed individuals were merged to form 20 diploid admixed individuals. It is important to distinguish between the true ancestry proportion α in a simulated or real admixed individual and the parameter μ 1 used as input to HAPMIX, which may differ from α (if α is unknown). Similarly, it is important to distinguish between the true number λ of generations since admixture and the parameter T used as input to HAPMIX. Below we explore the consequences of inaccurately specifying the parameters μ 1 and T. The reference populations used as input to HAPMIX consisted of 60 YRI individuals (120 haploid chromosomes) and 60 CEU individuals (120 haploid chromosomes) from the International HapMap Project [16]. A joint analysis of HGDP and HapMap data indicated that F ST(Yoruba,YRI) = 0.000 and F ST(French,CEU) = 0.001, so that the reference populations used as input to HAPMIX were extremely accurate. All HAPMIX simulations were restricted to 519,248 autosomal markers present in HGDP data which were polymorphic in phased YRI and phased CEU data from HapMap. For comparison purposes, we ran the ANCESTRYMAP, and LAMP-ANC programs on the same simulated data sets, making use of diploid YRI and CEU genotype data from HapMap and restricting all input data to subsets of markers that were unlinked in the reference populations, as recommended by those methods [1],[11]. For the ANCESTRYMAP runs, we chose a subset of 20 ancestry segments, we note that the genetic map used as input to the software has total length 35.5 Morgans. For an individual with admixture proportion α, we expect to observe a fraction 2α(1-α) of all recombination events occurring since admixture (i.e. those that result in a change in ancestry). Given λ generations since admixture, we therefore expect to see a total of 142 λ α(1-α) events in a diploid individual. Estimating α using the observed genome-wide ancestry proportion μ for that individual, if N ancestry transitions are observed, then a natural moment estimator of the number of generations since admixture is We excluded 3 clear outlier individuals who had more than 20 inferred generations of admixture, because we believe this is likely to indicate partial ancestry from a third source population in these individuals. Analysis of 29 Mozabite samples We analyzed 29 Mozabite samples from the HGDP data set. A total of 30 Mozabite individuals were originally genotyped as part of the HGDP, but one individual (HGDP01281) was excluded due to cryptic relatedness. We ran HAPMIX on the 29 Mozabite individuals using YRI and CEU as the input reference populations. We inferred the number of generations since admixture that provided the best fit to the data, and computed F ST values between the inferred ancestral segments and the reference populations (YRI and CEU), as described above for the African American data set. Analysis of other HGDP populations We ran HAPMIX on a total of 13 populations from the HGDP data that were of African, European, or Middle Eastern ancestry. For each population, we used YRI and CEU as the input reference populations, and estimated the European-related mixture proportion. For populations with European-related ancestry that was estimated to be more than 0% and less than 100%, we also estimated the number of generations since mixture. Web resources The HAPMIX software is available for downloading at the following URL: http://www.stats.ox.ac.uk/~myers/software.html. Results Simulations Simulations of local ancestry inference We began by examining the performance of HAPMIX in a set of 20 simulated admixed individuals, with an average of 80% African ancestry and 20% European ancestry, and generated with admixture occurring 6 generations ago (λ = 6; see Materials and Methods). These parameters were chosen to be in the range of typical values for African Americans. We implemented a simulation framework in which admixed individuals were constructed using genotype data from the Human Genome Diversity Project, but modeled using reference populations from HapMap (see Materials and Methods). We compared the local ancestry estimates produced by HAPMIX (probabilities of 0, 1, or 2 copies of European ancestry) to the true values of local ancestry that were simulated. These simulation results suggest that our method is likely to provide near optimal ancestry reconstruction in African Americans: the squared correlation between predicted and true number of European copies (across all samples) was equal to 0.98, and discernment of ancestry transitions was extremely sharp, as seen in a plot of the predicted vs. true number of European copies for an admixed sample on chromosome 1 (Figure 2A). For comparison purposes, we also computed local ancestry estimates using the ANCESTRYMAP and LAMP-ANC programs [1],[11] (see Materials and Methods). (We chose not to explicitly compare HAPMIX to additional recently developed methods such as SABER, LAMP, uSWITCH and uSWITCH-ANC [10]–[12] , because in previous work the LAMP-ANC method—which we do compare HAPMIX to—has been shown to perform approximately as well as each of those methods in a range of scenarios [11].) The squared correlation between predicted and true number of European copies was equal to 0.86 for ANCESTRYMAP, 0.83 for LAMP-ANC and discernment of ancestry transitions was less sharp or sometimes missed entirely (Figure 2A). 10.1371/journal.pgen.1000519.g002 Figure 2 Comparison of ancestry estimates produced by HAPMIX, ANCESTRYMAP, and LAMP-ANC. (A) Results comparison for a simulated recently admixed sample on chromosome 1. On each plot, the y-axis denotes the number of European chromosomal copies predicted by each method. The centromere of the chromosome is blanked out in white. The top plot shows the true number of European chromosomes, while the subsequent labeled plots show the results of applying each respective method. (B) Results comparison for a real African American individual across chromosome 1. Plots are constructed as in (A). We note the visible similarity to the simulation results. A more challenging setting for ancestry inference is when admixture occurs further back in time, resulting in smaller ancestry segments. We therefore repeated the above comparisons with increasing lambda (Figure 3). The results show a uniformly better performance by HAPMIX relative to the other two methods, with the comparative advantage of HAPMIX increasing with time since admixture. 10.1371/journal.pgen.1000519.g003 Figure 3 Accuracy of HAPMIX, ANCESTRYMAP, and LAMP-ANC predictions for various values of λ, the number of generations since admixture. For each admixture time, results are based on analyzing 20 admixed individuals, simulated using an average genome-wide proportion of 80% African and 20% European ancestry. For each method, we plot the squared correlation between predicted and true number of European copies as a function of λ. To investigate whether the probabilities of 0, 1, or 2 copies of European ancestry reported by HAPMIX are well-calibrated, we binned the predicted probabilities into bins of size 0.05 and compared, for each x = 0,1,2 and for each bin, the average predicted probability vs. the actual frequency in simulations of having x copies of European ancestry. For example, in the λ = 6 simulation, restricting to instances in which the predicted probability of 1 copy of European ancestry was between 0.05 and 0.10, the average predicted probability of 1 copy of European ancestry was 0.07 and the true frequency of 1 copy of European ancestry was 0.08, which is close to 0.07. More generally, we observed that HAPMIX predictions from our λ = 6 and λ = 100 simulations were well calibrated for each value of x = 0,1,2 (Figure 4). The calibration of intermediate bins appears visually worse for the λ = 6 simulation; however, the proportion of the genome that is in the most extreme bins where the method is certain is 98%, 97%, 99%, for x = 0,1,2 in these simulations, and hence the reliability of the probabilities remains good for recently admixed populations too. 10.1371/journal.pgen.1000519.g004 Figure 4 Properties of HAPMIX. (A) For simulated admixed data sets, constructed as described in Materials and Methods using λ = 6 and λ = 100, we plot the r 2 between predicted and true number of European chromosomal copies, as a function of the number of markers genotyped across the genome. (B) The same as part A, except we now fix the number of markers genotyped at 500,000, and vary the number of input chromosomes used to predict ancestry (for full details, see text). (C) Calibration of uncertainty estimates produced by HAPMIX. For the λ = 6 simulations, and for each of x = 0, x = 1, and x = 2 we compare the average probability of x copies of European ancestry predicted by HAPMIX to the true frequency of having x copies of European ancestry, binning the predicted probabilities of x copies of European ancestry into bins of size 0.05. If the method were perfectly calibrated, the results would lie along the line y = x (thin black line). Note that for λ = 6, ancestry is normally inferred with high certainty, and over 98% of data points fall into the most extreme two bins. (D) The same as part A, except using λ = 100. Both the last two plots show reasonable calibration of HAPMIX. We also used the HAPMIX predictions to compute an estimate of the squared correlation between predicted and true #European copies (see Materials and Methods). We obtained estimates of 0.98 for the λ = 6 simulation and 0.83 for the λ = 100 simulation, which are identical to the true r 2 values of 0.98 for λ = 6 and 0.83 for λ = 100, consistent with the finding that HAPMIX predictions are well calibrated. Although most of our simulations focused on individuals of mixed African and European ancestry, we also considered a more general set of two-way mixtures of African, European, Chinese and/or Japanese populations. We again observed that HAPMIX outperformed other methods (see Text S1). Furthermore, although HAPMIX is currently implemented assuming only two reference populations, we were able to attain accurate results in a more complex scenario of three-way admixture, by running HAPMIX in a two-way mode using different choices of reference populations (see Text S1). Simulations of local ancestry inference using inaccurate reference populations In many real-world settings, the true reference populations for a particular admixture event may not have had suitable genetic data gathered, or may no longer exist. To test for the effect of this situation on HAPMIX, we repeated our simulations at λ = 6 and λ = 100 using 20 admixed samples that were simulated using Mandenka and Basque individuals but modeled using reference populations YRI and CEU, which are inaccurate reference populations (see Materials and Methods). For λ = 6, the squared correlation between predicted and true #European copies remained high at 0.95, only marginally worse than the 0.98 obtained using accurate reference populations. For λ = 100, the squared correlation was 0.76, again only slightly worse than the 0.83 obtained using accurate reference populations. In short, the effects of these levels of inaccuracy in the reference populations (F ST = 0.01) are relatively small. We also repeated our simulations at λ = 6 and λ = 100 using 20 admixed samples that were simulated using Yoruba and Druze but modeled using reference populations YRI and CEU, (see Materials and Methods). The squared correlation between predicted and true number of European copies was 0.97 at λ = 6 and 0.79 at λ = 100, as compared to 0.98 and 0.83 using accurate reference populations. Thus, HAPMIX is robust to rather inaccurate (F ST = 0.02) reference populations, and to the asymmetric case where only one reference population is inaccurate. Simulations of local ancestry inference as a function of data size and parameter settings We investigated how the accuracy of HAPMIX varies with data size, by varying either the number of markers or the number of reference chromosomes, in our λ = 6 and λ = 100 simulations (see Materials and Methods). Accuracy as a function of the number of markers is displayed in Figure 4A, which shows that as few as 50,000 random markers are close to optimal for λ = 6 but that hundreds of thousands of markers are needed to produce optimal results in the more challenging case where λ = 100. Accuracy as a function of the number of reference chromosomes is displayed in Figure 4B, which shows that as few as 40 chromosomes (phased from 20 diploid samples) from each reference population are close to optimal. We also investigated how the accuracy of HAPMIX is affected when the parameters used as input are inaccurately specified (see Materials and Methods). Results of our simulations in which the genome-wide ancestry proportion μ 1 was inaccurately specified (different from the value α used to simulate the data) are displayed in Table 1. We observed that even if μ 1 is very inaccurate (e.g. by a factor of 4), there is no effect on results for λ = 6 and only a minimal effect (which primarily affects the genome-wide average of HAPMIX ancestry estimates, but not their correlation with true ancestry) for λ = 100. Results of our simulations in which the number of generations T since admixture was inaccurately specified (different from the value λ used to simulate the data) are displayed in Table 2. We observed that even if T is very inaccurate (e.g. by a factor of 2 to 5), there is no effect on results for λ = 6 and only a minimal effect for λ = 100. Thus, HAPMIX appears to be extremely robust to parameter misspecification. 10.1371/journal.pgen.1000519.t001 Table 1 HAPMIX accuracy as a function of ancestry proportion parameter. μ 1 λ = 6 simulated data: r2 (αaverage) λ = 100 simulated data: r2 (αaverage) 0.05 0.98 (0.20) 0.82 (0.18) 0.10 0.98 (0.20) 0.83 (0.19) 0.20 0.98 (0.20) 0.83 (0.20) 0.40 0.98 (0.20) 0.83 (0.21) 0.80 0.98 (0.20) 0.83 (0.22) We list both the r 2 between true and inferred ancestry, and the genome-wide average α avg of HAPMIX ancestry estimates, as a function of the parameter μ 1 used as input to HAPMIX, for data simulated at λ = 6 and λ = 100. Results for HAPMIX runs in which the ancestry proportion was specified correctly are underlined. 10.1371/journal.pgen.1000519.t002 Table 2 HAPMIX accuracy as a function of date of admixture parameter. T r 2 for λ = 6 simulated data r 2 for λ = 100 simulated data 2 0.98 n/a 4 0.98 n/a 6 0.98 0.68 8 0.98 0.72 10 0.98 0.77 20 0.98 0.81 40 0.97 0.83 100 0.94 0.83 200 n/a 0.83 400 n/a 0.80 We list the r 2 between true and inferred ancestry as a function of the parameter T used as input to HAPMIX, for data simulated at λ = 6 and λ = 100. Results for HAPMIX runs in which the date of admixture was specified correctly are underlined. We did not attempt runs in which T differs from the correct date of admixture by a factor of >20. Inference of ancestral populations We are interested in applying HAPMIX to improve our understanding of ancestral populations contributing to admixture events. To explore the usefulness of the software for this purpose, we analyzed segments of inferred African or inferred European ancestry from our λ = 6 and λ = 100 simulations to investigate how closely they corresponded to the true ancestral populations used to simulate admixed individuals (see Materials and Methods). We chose to use F ST, a commonly applied summary statistic, to quantify differences between the inferred and actual ancestral populations. In the λ = 6 simulations using Yoruba and French ancestral populations, which closely match the YRI and CEU reference populations, the F ST values between segments of inferred ancestry and the corresponding ancestral populations were equal to 0.001, indicating a tight correspondence (Table 3). The λ = 100 simulations produced a similarly tight correspondence (Table 3), even though values of local ancestry could only be inferred with moderate accuracy (Figure 3). The correspondence between inferred ancestral segments and true ancestral populations remained reasonably tight even when the true ancestral populations (either Mandenka and Basque, or Yoruba and Druze) were inaccurately modeled by the reference populations (YRI and CEU) used for inference (Table 3). Thus, HAPMIX shows promise for reconstructing ancestral populations that are somewhat different from available reference populations. 10.1371/journal.pgen.1000519.t003 Table 3 Inference of ancestral populations. trueAFR trueEUR λ F ST(inferredAFR,trueAFR) F ST(inferredEUR,trueEUR) Yoruba French λ = 6 0.001 0.001 Yoruba French λ = 100 0.000 0.003 Mandenka Basque λ = 6 0.000 0.003 Mandenka Basque λ = 100 0.001 0.003 Yoruba Druze λ = 6 0.000 0.006 Yoruba Druze λ = 100 0.001 0.007 For admixed samples simulated at λ = 6 and λ = 100 from an ancestral African population (trueAFR) and an ancestral European population (trueEUR), we report the value of F ST between segments of African ancestry (inferredAFR) or European ancestry (inferredEUR) inferred by HAPMIX and the true ancestral populations. Although the correspondence between inferred ancestral segments and true ancestral populations is reasonably tight, it is not perfect, with F ST values as large as 0.007 between inferred European segments and the European ancestral population in the Yoruba/Druze simulations (Table 3). Interestingly, the European population with this high F ST value contributed only 20% of the ancestry on average in our simulations. We hypothesized that rare erroneous ancestral segments might be having a disproportionate effect on F ST estimation for this group, particularly at sites where only a few simulated individuals really had ancestry from the Druze, where errors might dominate. Consistent with this idea, when we restricted our analysis to only positions where we inferred at least 5 chromosomes from the European population, results were considerably more accurate (F ST = 0.004 for λ = 100 and F ST = 0.003 for λ = 6). Also consistent with this hypothesis, when we repeated the Yoruba/Druze simulations with 50% European ancestry, results were considerably more accurate (0.001 or less for all F ST values corresponding to Table 3, for both λ = 6 and λ = 100 and for both European and African segments). Thus, although greater potential for inaccuracy exists in the inference of segments of an ancestral population which on average contributes only a small number of chromosomes to the admixed sample, there is hope of increasing accuracy in this context by appropriate filtering of results. Inference of date of admixture Our results show that supplying the correct value of the number of generations since admixture to HAPMIX has virtually no impact on the accuracy of inference of local ancestry (Table 2). Nonetheless, inferring the date of admixture remains an important aim for making inferences about history. We tested the effectiveness of HAPMIX in inferring the date of admixture by computing likelihoods at different values of T, using data that was simulated at λ = 6, λ = 20 and λ = 100 (see Materials and Methods). The highest likelihoods were obtained at T = 6, T = 17 and T = 75, respectively, with steep likelihood functions leaving little predicted uncertainty in these estimates. Thus, inference of date of admixture is imperfect—with a moderate bias towards underestimation for larger of values of λ—but still potentially useful. We also tried running HAPMIX to infer the date of admixture using data simulated under a double-admixture scenario (λ = 6∘100) (see Materials and Methods). This data set violates the model assumption of a single admixture event producing an exponential distribution of ancestry segment lengths. In this simulation, the highest likelihood was obtained at T = 45, intermediate between the true admixture times. In the context of multiple admixture events, the HAPMIX date estimate can be loosely interpreted as an estimate of the number of crossover events per unit of genetic distance that have occurred since admixture. We expect this estimate to lie within the time period spanned by the admixture events. Analysis of 935 African American samples We ran HAPMIX on 935 African American samples to obtain local ancestry estimates at each location in the genome (see Materials and Methods). Although the true number of European copies at each locus is unknown, the probabilities produced by HAPMIX provide an estimate of the squared correlation between predicted and true number of European copies (see Materials and Methods). Our estimate was r 2 = 0.98, which implies that HAPMIX can provide close to full power for admixture mapping of disease genes in African Americans. We also ran the ANCESTRYMAP and LAMP-ANC programs on these data [1],[11] (see Materials and Methods). Discernment of ancestry transitions was much sharper for HAPMIX compared to the other methods, as seen in a plot of number of European copies predicted by each method for an African American sample on chromosome 1 (Figure 2B). This is expected from our results on simulated data (Figure 2A). In addition to verifying that predictions are accurate on average, it is also important to check that there are no regions of the genome showing systematically inaccurate ancestry predictions. Such regions could produce spurious signals of selection after admixture in scans of control individuals, or spurious admixture association signals in scans of disease cases [13]. Because such scans examine the tail of the observed distribution, even a single region where results are biased could be a serious confounder. With this in mind, we computed the average ancestry across all samples for each locus in the genome, as predicted by either HAPMIX or ANCESTRYMAP, and then searched for unusual deviations. HAPMIX estimates ranged between 16% and 22% European ancestry, and ANCESTRYMAP estimates ranged between 16% and 21%, with a mean of 19% for both methods. These small deviations from the mean are not statistically significant (nominal P-value = 0.001 for the most extreme value over hundreds of independent loci) and can be attributed to sampling variation in the individuals analyzed. We used HAPMIX to estimate the value of λ (the number of generations since admixture) that provided the best fit to the African American data set by computing likelihoods at different values of T (see Materials and Methods). We obtained an estimate of λ = 7, which matches the value of λ = 7.0 inferred by ANCESTRYMAP on the same data, and is similar to the value of λ = 6.3 previously inferred by ANCESTRYMAP on other African American data sets [4]. We also used inferred segments of African or European ancestry to estimate F ST values between the true ancestral populations of African Americans and the two reference populations used here (YRI and CEU, as well as African and European populations from the HGDP) (see Materials and Methods). We obtained estimates of 0.001 for the F ST between the true African ancestral population and YRI, and 0.001 for the F ST between the true European ancestral population and CEU. This is consistent with estimates of F ST = 0.001 derived from the τ parameter inferred by ANCESTRYMAP on the same data (F ST = 0.5/τ), and consistent with our previous findings that YRI and CEU provide accurate reference populations for admixture analysis of African Americans [4],[25]. Correspondingly, among the HGDP populations the lowest F ST to the true African ancestral population was obtained for the Yoruba population (F ST = 0.0008). The Bantu South African, Mandenka and Bantu Kenya groups had the next lowest values (F ST 0.035. This supports a West African origin for the African ancestry segments in African Americans, in agreement with historical records. For the European ancestral population, the lowest F ST was with French (F ST = 0.0013) with Italian, Orcadian, Tuscan, Russian, Basque and Adygei then showing increasing values, but F ST 0% and <100%, we also inferred the number of generations since mixture. Discussion We have described a method that takes advantage of haplotype information to accurately infer segments of chromosomal ancestry in admixed samples, even in the case of ancient admixture. The method is likely to be useful both for disease mapping in admixed populations and for drawing inferences about human history, as our empirical analyses of samples from African American and HGDP populations have demonstrated. The ability to reconstruct chromosomal segments from ancestral populations that contributed to recent or ancient admixture is a particular advance, as it implies that genetic analyses need not be restricted to extant populations but can also be applied to populations that have only left admixed descendents today [28]. By reconstructing allele frequencies and haplotypes from these populations, extensions of HAPMIX may be able to learn about population relationships as they existed at the time of the Neolithic agricultural migrations or even before. An open question is how far back in time HAPMIX will be able to probe the histories of anciently admixed populations. The simulations of Figure 3 suggest that HAPMIX has power in theory to produce informative estimates of local ancestry even for populations that admixed 400 generations – over 10,000 years ago. HAPMIX has particularly important applications for disease gene mapping, especially in African Americans where the ancestry estimates are exceedingly accurate and where we have shown that they are not systematically biased. With the accurate estimates of ancestry that emerge from HAPMIX it should be possible to carry out dense case-control association studies with hundreds of thousands of markers, which simultaneously test for admixture association [1]–[3] and case-control association, providing more power to detect disease associations from the data than that can be obtained from either approach alone. While our analyses show that HAPMIX—because of its explicit use of a population genetic model—has better power to infer locus-specific ancestry than many recent methods, the method also has some limitations in the range of scenarios in which it can be used. For example, it is not currently designed for the analysis of mixtures of more than two ancestral populations, and it requires the use of reference populations. Future directions for extending the HAPMIX method include allowing more than two ancestral populations, using the admixed samples as a pool of reference haplotypes instead of relying on input haplotypes from reference populations, and automating the fitting of model parameters. In addition, although determining the number of generations since admixture with high accuracy is not necessary for effective inference of local ancestry, our results motivate additional work to enable detection of multiple admixture events at different points in time in order to refine the inferences that can be made about human history. Supporting Information Text S1 Supplementary note. (0.05 MB DOC) Click here for additional data file. Text S2 Appendix. (0.10 MB DOC) Click here for additional data file.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            The landscape of recombination in African Americans

            Recombination, together with mutation, is the ultimate source of genetic variation in populations. We leverage the recent mixture of people of African and European ancestry in the Americas to build a genetic map measuring the probability of crossing-over at each position in the genome, based on about 2.1 million crossovers in 30,000 unrelated African Americans. At intervals of more than three megabases it is nearly identical to a map built in Europeans. At finer scales it differs significantly, and we identify about 2,500 recombination hotspots that are active in people of West African ancestry but nearly inactive in Europeans. The probability of a crossover at these hotspots is almost fully controlled by the alleles an individual carries at PRDM9 (P<10−245). We identify a 17 base pair DNA sequence motif that is enriched in these hotspots, and is an excellent match to the predicted binding target of African-enriched alleles of PRDM9.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Reconstructing the Population Genetic History of the Caribbean

              Introduction Genomic characterization of diverse human populations is critical for enabling multi-ethnic genome-wide studies of complex traits [1]. Genome-wide data also affords reconstruction of population history at finer scales, shedding light on evolutionary processes shaping the genetic composition of peoples with complex demographic histories. This genetic reconstruction is especially relevant in recently admixed populations from the Americas. Native peoples throughout the American continent experienced a dramatic demographic change triggered by the arrival of Europeans and the subsequent African slave trade. Important progress has been made to characterize genome-wide patterns of these three continental-level ancestral components in admixed populations from the continental landmass [2] and other Hispanic/Latino populations [3], including recent genotyping and sequencing studies involving Puerto Rican samples [4], [5], [6]. However, no genomic survey has focused on multiple populations of Caribbean descent, and critical questions remain regarding their recent demographic history and fine-scale population structure. Several factors distinguish the Antilles and the broader Caribbean basin from the rest of North, Central, and South America, resulting in a unique territory with particular dynamics impacting each of its ancestral components. First, native pre-Columbian populations suffered dramatic population bottlenecks soon after contact. This poses a challenge for reconstructing population genetic history because extant admixed populations have retained a limited proportion of the native genetic lineages [7]. Second, it is widely documented that the initial encounter between Europeans and Native Americans, such as the first voyages of Columbus, took place in the Caribbean before involving mainland populations. However it remains unclear whether the earlier onset of admixture in the Caribbean translates into substantial differences in the European genetic component of present-day admixed Caribbean genomes, compared to other Hispanic/Latino populations impacted by later, and probably more numerous, waves of European migrants. Third, the Antilles and surrounding mainland of the Caribbean were the initial destination for much of the trans-Atlantic slave trade, resulting in admixed populations with higher levels of African ancestry compared to most inland populations across the continent. However, the sub-continental origins of African populations that contributed to present-day Caribbean genomes remain greatly under-characterized. Disentangling the origin and interplay among ancestral components during the process of admixture enhances our knowledge of Caribbean populations and populations of Caribbean descent, informing the design of next-generation medical genomic studies involving these groups. Here, we present SNP array data for 251 individuals of Caribbean descent sampled in South Florida using a parent-offspring trio design and 79 native Venezuelans sampled along the Caribbean coast. The family-based samples include individuals with grandparents of either Cuban, Haitian, Dominican, Puerto Rican, Colombian, or Honduran descent. The 79 native Venezuelan samples are of Yukpa, Warao, and Bari tribal affiliation. We construct a unique database which includes public and data access committee-controlled data on genomic variation from over 3,000 individuals including HapMap [8], 1000 Genomes [6], and POPRES [9] populations, and African [10] and Native American [11] SNP data from diverse sub-continental populations employed as reference panels. We apply admixture deconvolution methods and develop a novel ancestry-specific PCA method (ASPCA) to infer the sub-continental origin of haplotypes along the genome, yielding a finer-resolution picture of the ancestral components of present-day Caribbean and surrounding mainland populations. Additionally, by analyzing the tract length distribution of genomic segments attributable to distinct ancestries, we test demographic models of the recent population history of the Greater Antilles and mainland populations since the onset of inter-continental admixture. Results Population structure of the Caribbean To characterize population structure across the Antilles and neighboring mainland populations, we combined our genotype data for the six Latino populations with continental population samples from western Africa, Europe, and the Americas, as well as additional admixed Latino populations (see Table S1). To maximize SNP density, we initially restricted our reference panels to representative subsets of populations with available Affymetrix SNP array data (Figure 1A). Using a common set of ∼390 K SNPs, we applied both principal component analysis (PCA) and an unsupervised clustering algorithm, ADMIXTURE [12], to explore patterns of population structure. Figure 1B shows the distribution in PCA space of each individual, recapitulating clustering patterns previously observed in Hispanic/Latino populations [3]: Mexicans cluster largely between European and Native American components, Colombians and Puerto Ricans show three-way admixture, and Dominicans principally cluster between the African and European components. Ours is the first study to characterize genomic patterns of variation from (1) Hondurans, which we show have a higher proportion of African ancestry than Mexicans, (2) Cubans, which show extreme variation in ancestry proportions ranging from 2% to 78% West African ancestry, and (3) Haitians, which showed the largest average proportion of West African ancestry (84%). Additional clustering patterns obtained from higher PCs are shown in Figure S1. 10.1371/journal.pgen.1003925.g001 Figure 1 Population structure of Caribbean and neighboring populations. A) Areas in red indicate countries of origin of newly genotyped admixed population samples and blue circles indicate new Venezuelan (underlined) and other previously published Native American samples. B) Principal Component Analysis and C) ADMIXTURE [12] clustering analysis using the high-density dataset containing approximately 390 K autosomal SNP loci in common across admixed and reference panel populations. Unsupervised models assuming K = 3 and K = 8 ancestral clusters are shown. At K = 3, Caribbean admixed populations show extensive variation in continental ancestry proportions among and within groups. At K = 8, sub-continental components show differential proportions in recently admixed individuals. A Latino-specific European component accounts for the majority of the European ancestry among Caribbean Latinos and is exclusively shared with Iberian populations within Europe. Notably, this component is different from the two main gradients of ancestry differentiating southern from northern Europeans. Native Venezuelan components are present in higher proportions in admixed Colombians, Hondurans, and native Mayans. We used the program ADMIXTURE to fit a model of admixture in which an individual's genome is composed of sites from up to K ancestral populations. We explored K = 2 through 15 ancestral populations (Figure S2) to investigate how assumptions regarding K impact the inference of population structure. Assuming a K = 3 admixture model, population admixture patterns are driven by continental reference samples with no continental subdivision (Figure 1C, top panel). However, higher Ks show substantial substructure in all three continental components. Log likelihoods for successively increasing levels of K continue to increase substantially as K increases (Figure S3a), which is not unexpected since higher values of K add more parameters to the model (thereby improving the fit). Using cross-validation we found that K = 7 and K = 8 have the lowest predicted error (Figure S3b); thus, we focused on these two models. The first sub-continental components that emerge are represented by South American population isolates, namely the three Venezuelan tribes of Yukpa, Warao, and Bari. At higher-order Ks, we recapitulate the well-documented North-to-South American axis of clinal genetic variation described by us [13] and others [11], [14], as Mesoamerican (Maya/Nahua) and Andean (Quechua/Aymara) populations are assigned to different clusters (Figure S2). Interestingly, Mayans are the only group showing substantially higher contributions from the native Venezuelan components (Figure 1C, bottom panel). Both Mesoamerican and Andean Native American samples contain considerable amounts of European ancestry, due to post-Columbian admixture. Above K = 7, we observe a North-to-South European differentiation, which is consistent with previous analyses [15], [16]. Surprisingly, we observe another European-specific component emerge as early as K = 5 and remain constant through K = 15 (Figure S2). This component accounts for the majority of the Caribbean Latinos' European ancestry, and it only appears in Mediterranean populations, including Italy, Greece, Portugal, and Spain at intermediate proportions. Throughout this paper, we refer to this component as the “Latino European” component, and it can be seen clearly in Figure 1C (“black” bars represent the Latino European component, “Red” bars represent the “Northern European”, and pink the “Mediterranean” or “Southern European” component). At K = 8, when the clinal gradient of differentiation between Southern and Northern Europeans appears, the Latino European component is seen only in low proportions in individuals from Portugal and Spain, whereas it is the major European component among Latinos (Figure 1C, bottom panel). To identify possible sex-biased gene flow in Caribbean populations, we compared the ancestry proportions of the X chromosome vs. the autosomes in each population. We observe a significant skew towards a higher proportion of Native American ancestry on the X chromosome than on the autosomes (p-value 0.05, Figure S4). Overall, we find evidence of a high Native American, and to a lesser extent African, female contribution in Caribbean populations. Additionally, our data show a strong signature of assortative mating based on genetic ancestry among Caribbean Latinos, as suggested by previous studies [17]. In particular, we see a strong correlation between maternal and paternal ancestry proportions (Figure S5). To assess significance, we compared correlation of ancestry assignments among parent pairs to 100,000 permuted male-female pairs for each continental ancestry. All p-values were highly significant (p 3% global Native American ancestry together with the full reference panel of ancestral populations (Figure S7). ASPC1 separates the northernmost populations of the continent from the rest, while the Brazilian Surui and Central American Cabecar define the extremes of ASPC2. Most Native American haplotypes from the admixed genomes fall along this second axis of variation, forming two overlapping population clusters: one represented primarily by Colombians and Hondurans, and the other by Cubans, Dominicans, and Puerto Ricans (no Haitian haplotypes were included due to low levels of Native American ancestry). Figure 4A shows a closer view, in which Colombians and most Hondurans cluster closer to Chibchan-speaking groups from Western Colombia and Central America, including the Kogi, Embera, and Waunana. In contrast, most Caribbean islanders cluster with Amazonian groups from Eastern Colombia, Brazil, and Guiana. The closest ancestral populations include the Guahibo, Piapoco, Ticuna, Palikur, and Karitiana, among others, some of which are settled along fluvial territories of the Orinoco-Rio Negro basin. This location may have facilitated communication from the rainforest to the coast, explaining the relationship with Caribbean native components. 10.1371/journal.pgen.1003925.g004 Figure 4 Sub-continental origin of Native American components in the Caribbean. A) Ancestry-specific PCA analysis restricted to Native American segments from admixed Caribbean individuals (colored circles) and a reference panel of indigenous populations (gray symbols) from [11], grouped by sampling location. Darker symbols denote countries of origin with populations clustering closer to our Caribbean samples. Indigenous Colombian populations were classified into East and West of the Andes to ease the interpretation of their differential clustering in ASPCA. Population labels are shown for samples defining PC axes and representative clusters within locations. B) ADMIXTURE model for K = 16 ancestral clusters considering additional Latino samples, a representative subset of African and European source populations, and 52 Native American populations from [11], plus three additional Native Venezuelan tribes genotyped for this project. Vertical thin bars represent individuals and white spaces separate populations. Native American populations from [11] are grouped according to linguistic families reported therein. Labels are shown for the populations representing the 12 Native American clusters identified at K = 16. Clusters involving multiple populations are identified by those with the highest membership values. C) Map showing the major indigenous components shared across the Caribbean basin as revealed by ADMIXTURE at K = 16 from B). Namely, Mesoamerican (blue), Chibchan (yellow), and South American (green). Colored bars represent individuals and their approximate sampling locations. Bars pooling genetically similar individuals from more than one population are plotted from left to right following north to south coordinates as listed by population labels. Guarani, Wichi, and Chane from north Argentina are pooled with Arara but only the location of the latter is shown to allow us to provide a zoomed view of the Caribbean region (see [11] for the full map of sampling locations). The thick arrow represents schematically the most accepted origin of the Arawak expansion from South America into the Great Antilles around 2,500 years ago according to linguistic and archaeological evidence [30]. Asterisks next to population labels denote Arawakan populations included in our reference panel. The thin arrow indicates gene flow between South America and Mesoamerica, possibly following a coastal or maritime route, accounting for the Mayan mixture and supporting pre-Columbian back migrations across the Caribbean. Interestingly, the indigenous component of insular Caribbean samples seems to be shared across the different islands, suggesting gene flow across the Caribbean basin in pre-Columbian times. To explore this possibility into more detail, we performed a model-based clustering analysis using the full reference panel of 52 Native American populations from Reich et al. [11] in addition to our three native Venezuelan populations. Individual admixture proportions from K = 2 through 20 are given in Figure S8. Focusing on Native American components, the first sub-continental signal (at K = 4) comprised a Chibchan component mainly represented by the Cabecar from Costa Rica and the Bari from Venezuela. Higher-order clusters pulled out Amazonian population isolates such as the Surui and Warao, as well as northern populations including the Eskimo-Aleut and Pima, in agreement with the outliers detected in our ASPCA analysis (Figure S7). Interestingly, from K = 5 through 10, the Chibchan component is shared at nearly 100% with the Yukpa sample located near the Venezuelan coast, and at nearly 20% with Mayans from the Yucatan peninsula and Guatemala (Figure S8). Higher-order clusters maintain the connection between Mayans and South American components. For example, at K = 16 (the model with the lowest cross-validation error; Figure S9b), an average of 35% of the genome in Mayans is shared with a mixed South American component mainly represented by the Ticuna, Piapoco, Guahibo, Arhuaco, Kogi, Embera, Palikur, and Wichi, among others (Figure 4B and C). The presence of considerable proportions of Central and South American components in the Mayan sample is indicative of possible “back” migrations from Central America and northern South America into the Yucatan peninsula, revealing active gene flow across the Caribbean, probably following a coastal or maritime route. This observation is in agreement with our ASPCA results from admixed genomes and reinforces the notion of an expansion of South American-based Native American components across the Caribbean basin. European ancestral components We performed ASPCA analysis restricted to European segments of admixed individuals with >25% of European ancestry and a panel of European source populations, including 1,387 individuals from Europe sampled as part of the POPRES project [9], as well as additional Iberian samples from Galicia, Andalusia, and the Basque country in Spain [24]. The combined dataset included 2,882 European haplotypes and 255 haplotypes of European ancestry from the admixed populations. Figure 5 shows the first two PCs, where, as reported previously, the reference samples recapitulate a map of Europe [15], [25]. While most of the additional Iberian samples cluster together with the POPRES individuals sampled as Portuguese and Spanish, the Basques cluster separately from the centroid of most Iberian samples. The Basques are known for their historical and linguistic isolation, which could explain their genetic differentiation from the main cluster due to drift. Given the known Iberian origin of the first European settlers arriving into the Caribbean and surrounding territories of the New World, one would expect that European blocks derived from admixed Latino populations should cluster with other European haplotypes from present-day Iberians. Indeed, our Latino samples aggregate in a well-defined cluster that overlaps with the cluster of samples from the Iberian Peninsula (i.e., Portugal and Spain). However, we observed that the centroid is substantially deviated with respect to the Iberian cluster (bootstrap p-value 25% European ancestry derived from insular Caribbean (black symbols) and mainland populations (gray symbols) combined with a reference panel (colored labels) of 1,387 POPRES European samples with four grandparents from the same country [15], and 54 additional Iberian individuals (in yellow) from [24]. PC1 values have been inverted and axes rotated 16 degrees counterclockwise to approximate the geographic orientation of population samples over Europe. Population codes are detailed in Table S1 and regions within Europe are labeled as in [16]. Inset map: countries of origin for POPRES samples color-coded by region (areas not sampled in gray and Switzerland in intermediate shade of green to denote shared membership with EUR W, EUR C, and EUR S). Most Latino-derived European haplotypes cluster around the Iberian cluster. One of the two Haitian individuals included in the analysis clustered with French speaking Europeans (black arrow), in agreement with the colonial history of Haiti and illustrating the fine-scale resolution of our ASPCA approach. Importantly, when we applied ASPCA using the exact same reference panel of European samples but analyzing Mexican haplotypes of European ancestry (Moreno-Estrada, Gignoux et al., in preparation), we did not observe a deviated clustering pattern from the Iberian cluster: the effect is much weaker and not significant (bootstrap p-value = 0.099, see Figure S10). Furthermore, the deviation of the European segments of Mexican individuals from the distribution of the rest of Iberian samples is even smaller than the deviation of the Portuguese from the Spanish samples. We further evaluated whether the dispersion of the different subpopulations within the Caribbean cluster follow particular patterns along ASPC2, the axis driving the deviation from the Iberian centroid. We observed that Colombians and Hondurans tend to account for lower (more deviated) ASPC2 values compared to Cubans, Dominicans, and Puerto Ricans (Figure S11), suggesting a mainland versus insular population differentiation. We performed a Wilcoxon rank test to contrast ASPC2 for mainland (Colombia and Honduras) versus island (Cuba, Dominican Republic and Puerto Rico) populations, resulting in a highly significant p-value (1.5×10−15). Because >25% of European ancestry was required for inclusion in ASPCA, only two Haitian haplotypes were analyzed, and thus these were not included in the statistical analysis. Nonetheless, it is noteworthy that one of them clusters with the French, in agreement with historical and linguistic evidence regarding European settlements on the island (see arrow on Figure 5). Among European populations, Iberians also have the highest proportion of identical by descent (IBD) segments that are shared with Latino populations, as measured by a summed pairwise IBD statistic that is informative of the total amount of shared DNA between pairs of populations (see Materials and Methods and Figure S12). To explore the distribution of IBD sharing within continental groups, we considered Caribbean Latinos and Europeans separately by summing the cumulative amount of DNA shared IBD between each pair of individuals within each group. If European segments from Latino populations derive from a reduced number of European ancestors, then IBD sharing should be higher among Caribbean individuals compared to Europeans. Indeed, we observed a higher number of pairs sharing larger total IBD segment lengths among Latino individuals than among Europeans (Figure S13). Within-population cryptic relatedness is also compatible with increased IBD sharing. However, this is more likely to occur between individuals from the same subpopulation (e.g., COL-COL) rather than individuals from geographically separated subpopulations (e.g, COL-PUR). For this reason, we repeated the analysis, excluding within-population pairs of Latino individuals, and compared the IBD distribution to that of Iberian source populations (i.e., Spanish and Portuguese). Once again, we observed an increased proportion of IBD sharing among Latinos, arguing for a shared founder effect (Figure S13). These results are in agreement with our cluster-based analysis focused on global ancestry proportions, where the European ancestry of Latinos is dominated by a shared Latino-specific component differentiated from both southern and northern European components, although shared to some extent with Spanish and Portuguese (Figure 1C). Bottlenecked populations may exhibit differentiation from their parental gene pool due to loss of genetic diversity and stochastic shifts in allele frequencies. One way of quantifying the extent of genetic drift is to compare FST estimates among the K = 8 ancestral clusters from Figure 1C. In the absence of drift, we would expect the southern-derived Latino component and the southern European component to show a very low level of FST. However, we observe an FST = 0.021 (Table S3). To put this into perspective, the FST of southern vs. northern Europe is FST = 0.02, meaning that the differentiation of the Latino-specific component with respect to southern Europeans is at least as high as the north-south differentiation within Europe. This observation was replicated when including additional Latino and ancestral populations (Figure S8). Given the increased number of divergent clusters, we focused on K = 18 through 20, in which all sub-continental European components were jointly detected. In this case, the Latino-specific component shows further fragmentation into two components: one predominantly shared among insular Caribbean samples and the other among mainland Latinos. The FST value for southern versus northern European differentiation was 0.039, while values for southern versus insular (0.041) or mainland Latinos (0.04) were slightly inflated (Table S4), supporting the notion of additional differentiation impacting the European component of present-day admixed Latinos. African ancestral components The Caribbean region has a complex history of population exchange with the African continent as a result of slave trade practices during European colonialism. Its proximity to the North Atlantic Ocean facilitated nautical contact with the West African coast, increasing the exposure of the local population to slave trade routes and ultimately resulting in genetic admixture between Caribbean and African individuals. We found the proportion of African ancestry to be higher in Caribbean populations compared to those from the mainland (Figure 1C), a finding that is consistent across studies [3], [6], [26]. To explore the sub-continental composition of African segments derived from Caribbean admixed genomes, we performed ASPCA analysis on individuals with more than 25% of African ancestry using a diverse panel of African populations as potential sources (see Table S1). Our first approximation showed no dispersion of Afro-Caribbean haplotypes over PCA space. Instead, they form a relatively tight cluster that overlaps with that of the Yoruba sample from southwestern Nigeria (Figure S14). This is a plausible result, given the extensive historical record supporting a West African origin for the African lineages in the Americas. However, according to our tract length analysis, there is strong genetic evidence for the occurrence of at least two pulses of African migrants imprinting different genomic signatures in present day admixed Caribbean populations. This result raises the question of whether both pulses involved the same source population during the admixture process. If this were the case, it would easily explain our ASPCA results, where all African haplotypes point to a single source. Alternatively, if more than one source were involved and if enough mixing occurred since the two pulses, it is possible that what we see in ASPCA is the midpoint of the two source populations, causing the difference to remain undetected by our standard approach (which gives a point estimate averaging the signature of all African blocks along the genome). Hence, we applied a different strategy, in which ASPCA is performed separately for short (thus older) and long (younger) ancestry tracts. For this purpose, we split the African segments of each haploid genome into two categories based on a 50-cM length cutoff and intersected the data with a reference panel of West African populations (Figure 6A). Then, for each individual, we computed assignment probabilities of coming from each of the putative parental populations based on bivariate normal distributions fitted around each PCA cluster (see Materials and Methods, Figure S15). In Figure 6B we present the scaled mean probabilities for long (>50 cM) versus short ( 50 cM in red) ancestry tracts. African ancestry tracts for Puerto Ricans are shown and results for all populations are available in Figure S16. C) Proportion of African ancestry of inferred Mandenka origin as a function of block size in the combined set of Caribbean genomes. By running PCAdmix within the previously inferred African segments, we obtained posterior probabilities for Mandenka versus Yoruba ancestry. Overall, we found evidence for a differential origin of the African lineages in present day Afro-Caribbean genomes, with shorter (and thus older) ancestry tracts tracing back to Far West Africa (represented by Mandenka and Brong), and longer tracts (and thus younger) tracing back to Central West Africa. One caveat of this analysis is that short ancestry tracts are more likely to be misassigned. To rule this out as a source of the signal, we added an intermediate block size category (>5 cM and 10%. Four trios were not considered for trio phasing due to an excess of Mendelian errors (>100 K), two trios were removed due to 3rd or higher degree of relatedness between parents as inferred by IBD, and five trios were filtered due to cryptic relatedness between members of different trios above 10% IBD. After filtering, 65 complete trios remained for haplotype-based analyses. To study population structure and demographic patterns involving relevant ancestral populations, 79 previously collected samples from three native Venezuelan tribes were genotyped using the same array (i.e., 25 Yukpa [aka Yucpa], 29 Bari, and 25 Warao). We combined our data with publicly available genomic resources and assembled a global database incorporating genome-wide SNP array data for 3,042 individuals from which two datasets with different SNP densities were constructed (see Table S1). The high-density dataset included populations with available SNP data from Affymetrix arrays; namely African, European, and Mexican HapMap samples [8], Europeans from POPRES [9], West Africans from Bryc et al. [10], and Native Americans from Mao et al. [35]. After merging and quality control filtering, 389,225 SNPs remained and representative population subsets were used in different analyses as detailed through sections below. Our lower density dataset (30,860 SNPs) resulted from the intersection of our high-density dataset with available SNP data generated on Illumina platform arrays, including 52 additional Native American populations [11], as well as additional Latino populations sampled in New York City [7] and 1000 Genomes Latino samples [6]. The resulting dataset combines genomic data for 1,262 individuals from 80 populations. Full details on the population samples are available in Table S1. Population structure An unsupervised clustering algorithm, ADMIXTURE [12], was run on our high-density dataset to explore global patterns of population structure among a representative subset of 641 samples, including seven Native American, eleven POPRES European, HapMap3 Nigerian Yoruba, HapMap3 Mexican, and our six new Caribbean Latino populations (see Table S1). Fourteen ancestral clusters (K = 2 through 15) were successively tested. Log likelihoods and cross-validation errors for each K clusters are available in Figure S3. FST based on allele frequencies was calculated in ADMIXTURE v1.22 for each identified cluster at K = 8 and values are available in Table S3. Our low-density dataset comprising 1,262 samples (detailed in Table S1) was used to run K = 2 through 20. Log likelihoods, cross validation errors and FST values from ADMIXTURE are available in Figure S9 and Table S4. Principal component analysis (PCA) was applied to both datasets using EIGENSOFT 4.2 [36] and plots were generated using R 2.15.1. Sex bias in ancestry contributions was evaluated by selecting only females (to ensure we compare a diploid X chromosome to diploid autosomes), and running ADMIXTURE at K = 3 on the X chromosome and autosomes separately. The Wilcoxon signed rank test, a non-parametric version of the paired Student's t-test that does not require the normality assumption, was applied to assess the significance of the difference in X and autosomal ancestry proportions. This tests whether the average difference of ancestry proportions assigned to a given source population for the X and for the autosomes of each sample is significantly different from zero. The test was applied to the entire collection of Latino samples, revealing an over-arching trend, and then to each population in turn to identify any between-population differences. A rejection of the null hypothesis means that the ancestry proportions on the X and the autosomes are significantly different from one another but does not imply which proportion is larger. We provide box plots as a visual aid to show the direction of the difference (Figure S4). Global ancestry estimates from ADMIXTURE at K = 3 were used to test the correlation between male and female ancestry proportions considering all trio founders within each Caribbean population as well as within the full set of admixed trios. Linear models and permutations (up to 100,000) were performed using R 2.15.1. Phasing and local ancestry assignment Family trio genotypes from our six Caribbean populations and continental reference samples were phased using BEAGLE 3.0 software [37]. Local ancestry assignment was performed using PCAdmix (http://sites. google.com/site/pcadmix/ [19]) at K = 3 ancestral groups. This approach relies on phased data from reference panels and the admixed individuals. To maintain SNP density and maximize phasing accuracy we restricted to a subset of reference samples with available Affymetrix 6.0 trio data, namely 10 YRI, 10 CEU HapMap3 trios, and 10 Native American trios from Mexico [5]. Each chromosome is analyzed independently, and local ancestry assignment is based on loadings from Principal Components Analysis of the three putative ancestral population panels. The scores from the first two PCs were calculated in windows of 70 SNPs for each panel individual (in previous work we have estimated a suitable number of 10,000 windows to break the genome into when inferring local ancestry using PCAdmix, and in this case, after merging Affymetrix 6.0 data from admixed and reference panels, a total of 743,735 SNPs remained/10,000 = window length of ∼70 SNPs). For each window, the distribution of individual scores within a population is modeled by fitting a multivariate normal distribution. Given an admixed chromosome, these distributions are used to compute likelihoods of belonging to each panel. These scores are then analyzed in a Hidden Markov Model with transition probabilities as in Bryc et al. [10]. The g (generations) parameter in the HMM transition model was determined iteratively so as to maximize the total likelihood of each analyzed population. Local ancestry assignments were determined using a 0.9 posterior probability threshold for each window using the forward-background algorithm. In analyses that required estimating the length of continuous ancestry tracts, the Viterbi algorithm was used. An assessment of the accuracy of this approach is given in [5]. Tract length analysis We used the software Tracts [20] to identify the migratory model that best explains the genome-wide distribution of ancestry patterns. Specifically, we considered three migration models, each featuring a panmictic population absorbing migrants from three source populations. The models differ by the number of allowed migration events per population. In the simplest model, the population is founded by Native American and European individuals, and later receives a pulse of African migrants. The initial ancestry proportion and timing, as well as the African migration amplitude and timing, are fitted to the data as described below. The other two models feature an additional input of either European or African migrants; the timing and magnitude of this additional pulse result in two additional parameters that must be fitted to the data. Here, the data consisted of Viterbi calls from PCAdmix (see previous section and Figure 2), that is, the most probable assignment of local ancestry along the genomes. To fit parameters to these data, we tallied the inferred continuous ancestry tracts according to inferred ancestry and tract length using 50 equally spaced length bins per population, and one additional bin to account for full chromosomes. Given a migration model and parameters, Tracts calculates the expected counts per bin. Assuming that counts in each bin are Poisson distributed, it produces a likelihood estimate that is used to fit model parameters. For each population, we report the model with the best Bayesian Information Criterion (BIC) −2 Log(L)+k Log (n), with n = 153. Because we imposed a fixed number of migration pulses, we must keep in mind that migrations are likely to have been more continuous than what is displayed in the best-fitting models. One way to interpret the pulses are time points that the migrations probably spanned. Resolving the duration of each pulse would likely require refined models and a great deal more data. Ancestry-Specific Principal Component Analysis (ASPCA) To explore within-continent population structure, we applied the following approach for each of the continental ancestries (i.e., Native American, European, and African) of admixed genomes. The general framework is shown in Figure 2. It comprises locus-specific continental ancestry estimation along the genome, followed by PCA analysis restricted to ancestry-specific portions of the genome combined with sub-continental reference panels of ancestral populations. For this purpose, we used our continental-level local ancestry estimates provided by PCAdmix to partition each genome into ancestral haplotype segments, and retained for subsequent analyses only those haplotypes assigned to the continental ancestry of interest. This is achieved by masking (i.e., setting to missing) all segments from the other two continental ancestries. Because ancestry-specific segments may cover different loci from one individual to another, a large amount of missing data results from scaling this approach to a population level, which limits the resolution of PCA. To overcome this problem, we adapted the subspace PCA (ssPCA) algorithm introduced by Raiko et al. [38] to implement a novel ancestry-specific PCA (ASPCA) that allows accommodating phased haploid genomes with large amounts of missing data. Our method is analogous to the ssPCA implementation by Johnson et al. [23], which operates on genotype data. In contrast, ASPCA operates on haplotypes, allowing us to use much more of the genome (rather than just the parts estimated to have two copies of a certain ancestry) and to independently analyze the two haploid genomes of each individual. Finally, ancestry-specific haplotypes derived from admixed individuals are combined with haplotypes derived from putative parental populations and projected together onto PCA space. Details of the ASPCA algorithm and constructed datasets are described in Text S1. Differentiation of sub-European ancestry components To measure the observed deviation in ASPCA of European haplotypes derived from admixed Caribbean populations with respect to the cluster of Iberian samples, a bootstrap resampling-based test was performed. The null distribution was generated from comparing bootstraps of Portuguese and Spanish ASPCA values as models of the intrinsic Iberian population structure. We then compared the ASPCA values of the admixed individuals and tested if the observed differences between Iberian ASPCA values and those of the admixed individuals are more extreme than the differences within Iberia. The distance was determined using the chi-squared statistic of Fisher's method combining ASPC1 and ASPC2 t-tests for each bootstrap. We ran 10,000 bootstraps to determine one-tailed p-values. As Iberians we considered: POPRES Spanish, POPRES Portuguese, Andalusians, and Galicians; and as Caribbean Latinos: CUB, PUR, DOM, COL, and HON. Additional tests were performed comparing Portuguese versus the rest of Iberians and between an independent dataset of Mexican individuals analyzed by Moreno-Estrada, Gignoux et al. (in preparation) projected onto ASPCA space using the same reference panel of European populations. A bivariate test was performed to measure the relative deviation from the Iberian cluster of the distribution given by the Caribbean versus the Mexican dataset. To determine whether insular versus mainland Caribbean populations disperse over significantly different ranges in ASPC2, a Wilcoxon rank test was performed between (COL+HON) versus (CUB, PUR, DOM). Haitians were excluded due to low sample size (N = 2 haplotypes). Boxplot is available in Figure S11. Population differentiation estimates between clusters inferred with ADMIXTURE were visualized and compared across runs where both the Latino-specific and southern European components were detected. Values are available in Table S3 and Table S4. To provide independent evidence on the sub-continental ancestry of European haplotypes, we considered segments that are identical by descent (IBD) between unrelated Latino individuals and a representative subset of European populations. We used our high-density dataset to extract a subset of 203 POPRES European individuals and the founders of the 65 complete admixed trios. We first performed a genome-wide pairwise IBS estimation using PLINK [39] to ensure that the dataset contains no samples with more than 10% IBS with any other sample. Then we used fastIBD [37] to phase the data and estimate segments shared IBD longer than 2 Mb to eliminate false positive IBD matches and assuming that ancestry will be shared among pairwise IBD hits of segments this long. All 2 Mb or greater segments shared IBD between pairs of individuals were summed, and histograms were created for pairwise matches within each group (i.e., POPRES Europeans, Iberians, and Caribbean Latinos). To inform about the proportion of shared DNA between pairs of populations we calculated a summed pairwise IBD statistic, which is the sum of lengths of all segments inferred to be shared IBD between a given European population and each Latino population, normalized by sample size. Size-based ASPCA analyses Given the evidence from our tract length analysis for a second pulse of African migrants into the admixture of insular Caribbean Latinos, a modified size-based ASPCA analysis was performed. A reference panel was built integrating three different resources [8], [10], [40] and focusing on putative source populations from along the West African coast, including Mandenka from Senegal, Yoruba and Igbo from Nigeria, Bamoun and Fang from Cameroon, Brong from western Ghana, and Kongo from the Democratic Republic of the Congo. We begin with the continental local ancestry inference from PCAdmix K = 3. For each individual we then divide African ancestry tracts into small (0 to 50 cM) and large (>50 cM) size classes. Given a partition of African ancestry tracts, we take all sites included in one tract class, say short tracts, and run PCA on our sub-continental West African reference populations for only these sites. Using the first two PCs from this analysis, we fit a bivariate normal distribution to each reference population cluster. We then project our test sample into this PCA space, and estimate the probability of it coming from each reference population using the fitted distributions. This procedure is repeated for each tract class, for each individual. For each admixed Caribbean population, we can then estimate the probability that a given class of African ancestry tracts comes from a specific West African source population as the average probability of assignment to this population across all individuals. Finally, under the assumption that a given class of African tracts must come from one of the provided reference populations, we rescale these probabilities to sum to one. Each assignment estimate is also provided with error bars representing the standard error of the mean. We compare the short and long assignment probabilities for each Caribbean population to identify distinct sources for “older” and “younger” West African migratory source populations. Haitians were not included in the analysis due to low sample size (n = 4). Due to concerns that shorter tracts have a higher likelihood of mis-assignment, we added a medium tract size class (5 cM to 50 cM) to see if the results were simply due to very short (0 cM to 5 cM) European or Native American tracts being mis-classified as African. We compare the results for short and medium tracts and find that the trends are maintained suggesting the observation that older shorter tracts appear to be primarily from the Mandenka and Brong source populations is not simply due to short tract mis-assignment Local ancestry estimation within African tracts To identify likely regions of Yoruba versus Mandenka ancestry in the African component, we modified our implementation of PCAdmix to perform local ancestry deconvolution solely of the African segments of the admixed genomes. The modification is achieved in the final step of the algorithm: whereas the standard approach estimates a single HMM across an entire chromosome, here we fit J disjoint HMMs spanning each of the J blocks of African ancestry in a given chromosome for a given individual. Applying the method, we obtained posterior probabilities for Mandenka versus Yoruba ancestry within the previously inferred African segments. We then selected only those sub-regions that were confidently called as Mandenka or Yoruba, and stratified them by physical size. Supporting Information Figure S1 Principal component 1 versus lower order PCs defining sub-continental components among Native American populations. Top: PC5 separates Venezuelan population isolates from the rest of Native Americans. Bottom: PC7 separates Mesoamerican from Andean groups. Mexicans and Hondurans distribute between the European and Mesoamerican clusters, whereas Colombians slightly deviate towards the Andean and Venezuelan clusters. Global PCA analysis based on the high-density dataset (∼390 K SNPs) and thus limited to reference panel populations with available Affymetrix SNP array data (see Table S1 for details). (TIF) Click here for additional data file. Figure S2 ADMIXTURE results from K = 2 through 15 based on the high-density dataset (∼390 K SNPs) including 7 admixed Latino populations and 19 reference populations. A low-frequency Southern European component restricted to Mediterranean populations at lower order Ks and specifically to Iberian populations at higher order Ks, accounts for the majority of European ancestry among Latinos (black bars). It further decomposes into population-specific clusters (purple bars) denoting higher similarities within the European portion among Latinos compared to European source populations. (TIF) Click here for additional data file. Figure S3 ADMIXTURE metrics at increasing K values based on Log-likelihoods (A) and cross-validation errors (B) for results shown in Figure S2. (TIF) Click here for additional data file. Figure S4 Comparison of ADMIXTURE estimates obtained from autosomes and the X chromosome in different Latino/Caribbean populations. A) Cluster-based results for K = 3 using the same set of ancestral populations as in Figure S2. Because the X chromosome is diploid, the analysis was restricted to female individuals from the seven admixed Latino populations. Within each population, individuals are sorted from largest to smallest proportion of European ancestry. B) Box plot showing the directionality of the difference between X and autosomal ancestry proportions considering all populations together. P-values on top correspond to the Wilcoxon signed rank test applied to assess statistical significance (see Materials and Methods). C) Box plots and statistical tests for each population (Haitians excluded due to low sample size). The observed pattern strongly supports the presence of sex-biased gene flow during the process of admixture throughout the Caribbean, with significantly higher contribution from Native American, and to a lesser extent West African, ancestors into the composition of the X chromosome, which largely reflects the female demographic history of a population. (TIF) Click here for additional data file. Figure S5 Correlation between male and female continental ancestries. Parents' ancestry proportions from each trio were used to compare correlation coefficients between the observed values and 100,000 permuted male-female pairs (p-values shown for the combined set of Latino Caribbean samples and for each population in Table S2). (TIF) Click here for additional data file. Figure S6 Ancestry tract lengths distribution per population and demographic model tested in Tracts. For each demographic scenario, the observed distribution is compared to the predictions of the best-fitting migration model (displayed below each distribution). Solid lines represent model predictions and shaded areas are one-sigma confidence region surrounding the predictions. Three different demographic scenarios were considered, all of which assume the involvement of European and Native American tracts at the onset of admixture, followed by the introduction of African migrants (denoted by EUR,NAT+AFR). The second and third models allow for an additional pulse of European (EUR,NAT+AFR+EUR) and African (EUR,NAT+AFR+AFR) ancestry, respectively. Likelihood values for each model are shown on top of each plot. Pie charts above each migration model are proportional to the estimated number of migrants being introduced at each point in time (black arrows). GA: generations ago. (TIF) Click here for additional data file. Figure S7 ASPCA analysis of Native American haplotypes derived from admixed genomes (solid circles) and reference panel populations from [11] grouped by linguistic families as reported therein. Top panels: ASPCA with the full reference panel of Native American populations. Bottom panels: Filtered ASPCA without extreme outliers (Aleutians, Greenlanders, and Surui excluded from the analysis). Each individual from the reference panel is represented by the corresponding population label centered on its PCA coordinates. A zoomed version of PC1 vs. PC2 for the filtered set (bottom left) grouped by geographic sampling location is available in Figure 4A. (TIF) Click here for additional data file. Figure S8 ADMIXTURE results from K = 2 through 20 based on the low-density dataset (∼30 K SNPs) including additional admixed Latino and Native American reference populations (see Table S1 for details). The presence of the Latino European component (black and gray bars) is recaptured among independently sampled Latino populations. FL: Florida (this study); NY: New York; 1KG: 1000 Genomes Project samples. Native American populations from [11] are grouped according to linguistic families reported therein. Labels are shown for the populations representing the 15 Native American clusters identified at K = 20 (four of the remaining five being of European ancestry and one of West African ancestry). Clusters involving multiple populations are identified by those with the highest membership values. Throughout lower and higher order Ks, several South American components (yellow and green bars), show varying degrees of shared genetic membership with Mesoamerican Mayans, accounting for up to nearly half of their genome composition (see Figure 4 for more details). (TIF) Click here for additional data file. Figure S9 ADMIXTURE metrics at increasing K values based on Log-likelihoods (A) and cross-validation errors (B) for results shown in Figure S8. (TIF) Click here for additional data file. Figure S10 ASPCA distribution of Iberian samples (red circles) compared to European haplotypes derived from our Latino Caribbean samples (top panel) and from an independent cohort of Mexican samples (bottom panel). The relative deviation from the Iberian cluster is significantly different comparing the Caribbean versus the Mexican dataset (see the main text for details). (TIF) Click here for additional data file. Figure S11 ASPC2 values per population from the European-specific PCA analysis shown in Figure 5 and Figure S10. Population codes as in Table S1. The boxplot shows that low ASPC2 values are enriched with mainland Colombian and Honduran haplotypes, whereas insular Caribbean populations show less deviated values from the Iberian cluster. A Wilcoxon rank test between mainland (COL, HON) versus insular samples (CUB, PUR, DOM) demonstrated that these two groups disperse over significantly different ranges in ASPC2 (Haitians excluded due to low sample size). (TIF) Click here for additional data file. Figure S12 IBD sharing between different Caribbean Latino populations and a representative subset of POPRES European populations as measured by a summed pairwise IBD statistic. For each Latino population, maximum pairwise IBD levels were observed in those pairs involving Spanish and, to a lesser extent, Portuguese samples, in agreement with our ASPCA results. (TIF) Click here for additional data file. Figure S13 IBD sharing between pairs of individuals within A) Caribbean Latinos and B) a representative subset of POPRES European populations. Inset histograms display counts lower than 50 for the same binning categories. The overall count of pairs sharing short segments of total IBD is higher among Europeans, probably as a result of an older shared pool of source haplotypes. In contrast, the higher frequency of longer IBD matches among Latinos is compatible with a recent European founder effect. After excluding within-population pairs of Latino individuals (top right), there are still more and longer IBD matches among Caribbean populations compared to Iberians (bottom right). (TIF) Click here for additional data file. Figure S14 ASPCA analysis of African haplotypes derived from admixed genomes with >25% of African ancestry (black symbols) and a representative subset of African HapMap3 and other West African reference panel populations from [10]. Colombians and Hondurans excluded due to lower overall proportions of African ancestry. (TIF) Click here for additional data file. Figure S15 ASPCA analysis of short versus long African ancestry tracts from admixed genomes and West African reference panel populations. To exemplify our size-based ASPCA approach, the African genome of a Puerto Rican individual is displayed (denoted by PUR). Left: PUR clusters with Mandenka when only sites within short ancestry tracts ( 50 cM). (TIF) Click here for additional data file. Figure S16 African ancestry size-based ASPCA results per population sample. Considering three different classes of ancestry tract lengths (black: short; red: long; blue: intermediate), scaled assignment probabilities are shown for each African source population. Values on the y-axis are the average probability of assignment to each potential source population across all individuals within each Latino population (see Materials and Methods for details). (TIF) Click here for additional data file. Table S1 Summary of Latino populations and assembled reference panels. (PDF) Click here for additional data file. Table S2 Correlation p-values of male vs. female ancestry. (PDF) Click here for additional data file. Table S3 FST divergences between estimated populations for K = 8 using ADMIXTURE. (PDF) Click here for additional data file. Table S4 FST divergences between estimated populations for K = 20 using ADMIXTURE. (PDF) Click here for additional data file. Text S1 Methodology of the Ancestry-Specific PCA (ASPCA) implementation. (PDF) Click here for additional data file.
                Bookmark

                Author and article information

                Journal
                G3 (Bethesda)
                Genetics
                G3: Genes, Genomes, Genetics
                G3: Genes, Genomes, Genetics
                G3: Genes, Genomes, Genetics
                G3: Genes|Genomes|Genetics
                Genetics Society of America
                2160-1836
                3 November 2014
                December 2014
                : 4
                : 12
                : 2505-2518
                Affiliations
                [* ]Department of Computer Science, University of California, Los Angeles, California 90095
                []Department of Ecology and Evolutionary Biology, University of California, Los Angeles, California 90095
                []Interdepartmental Program in Bioinformatics, University of California, Los Angeles, California 90095
                [§ ]Department of Human Genetics, University of California, Los Angeles, California 90095
                [** ]Department of Human Genetics, University of Chicago, Chicago, Illinois 60637
                [†† ]Department of Pathology and Laboratory Medicine, Geffen School of Medicine at University of California, Los Angeles, California 90095
                Author notes
                [1 ]Corresponding author: Department of Pathology & Laboratory Medicine, Geffen School of Medicine at University of California, Los Angeles, 10833 Le Conte Ave, CHS 33-365, Los Angeles, CA 90095. E-mail: bpasaniuc@ 123456mednet.ucla.edu
                Article
                GGG_014274
                10.1534/g3.114.014274
                4267945
                25371484
                Copyright © 2014 Yang et al.

                This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License ( http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                Page count
                Pages: 14
                Product
                Categories
                Investigations
                Custom metadata
                v1

                Genetics

                genetic variation, genetic continuum, admixture, localization, ancestry inference

                Comments

                Comment on this article