110
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Reconstructing Native American Migrations from Whole-Genome and Whole-Exome Data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          There is great scientific and popular interest in understanding the genetic history of populations in the Americas. We wish to understand when different regions of the continent were inhabited, where settlers came from, and how current inhabitants relate genetically to earlier populations. Recent studies unraveled parts of the genetic history of the continent using genotyping arrays and uniparental markers. The 1000 Genomes Project provides a unique opportunity for improving our understanding of population genetic history by providing over a hundred sequenced low coverage genomes and exomes from Colombian (CLM), Mexican-American (MXL), and Puerto Rican (PUR) populations. Here, we explore the genomic contributions of African, European, and especially Native American ancestry to these populations. Estimated Native American ancestry is in MXL, in CLM, and in PUR. Native American ancestry in PUR is most closely related to populations surrounding the Orinoco River basin, confirming the Southern America ancestry of the Taíno people of the Caribbean. We present new methods to estimate the allele frequencies in the Native American fraction of the populations, and model their distribution using a demographic model for three ancestral Native American populations. These ancestral populations likely split in close succession: the most likely scenario, based on a peopling of the Americas thousand years ago (kya), supports that the MXL Ancestors split kya, with a subsequent split of the ancestors to CLM and PUR kya. The model also features effective populations of in Mexico, in Colombia, and in Puerto Rico. Modeling Identity-by-descent (IBD) and ancestry tract length, we show that post-contact populations also differ markedly in their effective sizes and migration patterns, with Puerto Rico showing the smallest effective size and the earlier migration from Europe. Finally, we compare IBD and ancestry assignments to find evidence for relatedness among European founders to the three populations.

          Author Summary

          Populations of the Americas have a rich and heterogeneous genetic and cultural heritage that draws from a diversity of pre-Columbian Native American, European, and African populations. Characterizing this diversity facilitates the development of medical genetics research in diverse populations and the transfer of medical knowledge across populations. It also represents an opportunity to better understand the peopling of the Americas, from the crossing of Beringia to the post-Columbian era. Here, we take advantage sequencing of individuals of Colombian (CLM), Mexican (MXL), and Puerto Rican (PUR) origin by the 1000 Genomes project to improve our demographic models for the peopling of the Americas. The divergence among African, European, and Native American ancestors to these populations enables us to infer the continent of origin at each locus in the sampled genomes. The resulting patterns of ancestry suggest complex post-Columbian migration histories, starting later in CLM than in MXL and PUR. Whereas European ancestral segments show evidence of relatedness, a demographic model of synonymous variation suggests that the Native American Ancestors to MXL, PUR, and CLM panels split within a few hundred years over 12 thousand years ago. Together with early archeological sites in South America, these results support rapid divergence during the initial peopling of the Americas.

          Related collections

          Most cited references64

          • Record: found
          • Abstract: found
          • Article: not found

          Fast model-based estimation of ancestry in unrelated individuals.

          Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multi-locus genotype data, can be used to perform a statistical correction for population stratification. One popular technique for estimation of ancestry is the model-based approach embodied by the widely applied program structure. Another approach, implemented in the program EIGENSTRAT, relies on Principal Component Analysis rather than model-based estimation and does not directly deliver admixture fractions. EIGENSTRAT has gained in popularity in part owing to its remarkable speed in comparison to structure. We present a new algorithm and a program, ADMIXTURE, for model-based estimation of ancestry in unrelated individuals. ADMIXTURE adopts the likelihood model embedded in structure. However, ADMIXTURE runs considerably faster, solving problems in minutes that take structure hours. In many of our experiments, we have found that ADMIXTURE is almost as fast as EIGENSTRAT. The runtime improvements of ADMIXTURE rely on a fast block relaxation scheme using sequential quadratic programming for block updates, coupled with a novel quasi-Newton acceleration of convergence. Our algorithm also runs faster and with greater accuracy than the implementation of an Expectation-Maximization (EM) algorithm incorporated in the program FRAPPE. Our simulations show that ADMIXTURE's maximum likelihood estimates of the underlying admixture coefficients and ancestral allele frequencies are as accurate as structure's Bayesian estimates. On real-world data sets, ADMIXTURE's estimates are directly comparable to those from structure and EIGENSTRAT. Taken together, our results show that ADMIXTURE's computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            An integrated map of genetic variation from 1,092 human genomes

            Summary Through characterising the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help understand the genetic contribution to disease. We describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methodologies to integrate information across multiple algorithms and diverse data sources we provide a validated haplotype map of 38 million SNPs, 1.4 million indels and over 14 thousand larger deletions. We show that individuals from different populations carry different profiles of rare and common variants and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways and that each individual harbours hundreds of rare non-coding variants at conserved sites, such as transcription-factor-motif disrupting changes. This resource, which captures up to 98% of accessible SNPs at a frequency of 1% in populations of medical genetics focus, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data

              Introduction Demographic models inferred from genetic data play several important roles in population genetics. First, they complement archeological evidence in understanding prehistorical events (such as the number and timing of major continental migrations) which have left no written record [1],[2]. Second, they facilitate the search for genetic regions that have been targets of non-neutral forces, such as recent natural selection, by guiding our expectations as to how much sequence and haplotype variation one expects to see in a given genomic region (and, more importantly, the variance around these expectations) [3]. Finally, existing demographic models can guide sampling design for subsequent population or medical genetic studies. Given their many uses, it is not surprising that many studies have inferred demographic models for populations of humans and other species [4]–[15]. The process of inferring a demographic model consistent with a particular data set typically involves exploring a large parameter space by simulating the model many times, often using coalescent-theory based Monte Carlo approaches. For computational reasons, many of the demographic inference procedures developed thus far have focused on single population models or models with multiple populations but no subsequent migration after subpopulations split (i.e., [4]–[6],[16],[17], but also see [10],[18]). Methods that do consider multiple populations with migration often assume independent non-recombining regions [7],[19] and do not often scale to genomic size data sets. Approaches for jointly considering recombination and migration often use a restricted set of summary statistics [9] of the data, which limits their statistical power. Finally, complex demographic inferences that make use of many summary statistics are often very computationally intensive [8],[10],[18], which precludes thorough investigation of their statistical properties. Here, we develop and apply a computationally efficient diffusion-based approach to the problem of demographic inference, based on the multi-population allele frequency spectrum (AFS) (i.e., the joint distribution of allele frequencies across diallelic variants) [10],[17],[18],[20],[21]. Given a genetic region sequenced in multiple individuals from each of populations, the resulting AFS is a P-dimensional matrix. Each entry of this matrix records the number of diallelic genetic polymorphisms in which the derived allele was found in the corresponding number of samples from each population. For example, if diploid individuals from two populations were sequenced, with 10 individuals from population 1 and 5 from population 2, the AFS would be a 21-by-11 matrix (indexed from 0). The [2,0] entry would record the number of polymorphisms for which the derived allele was seen twice in population 1 but never seen in population 2, while the [20],[5] entry would record polymorphisms for which the derived allele was homozygous in all individuals from population 1 and seen 5 times in population 2. If all polymorphic sites possess only two alleles and can be considered independent, the AFS is a complete summary of the data. Many of the statistics commonly used for population genetic inference, such as and Tajima's , are summaries of the AFS (see [18],[22]). Efficient techniques exist for simulating the AFS of a single population [4],[5],[23]. The joint AFS between two populations has been used by several recent studies [10],[11],[18],[24], but these have all relied upon very computationally intensive coalescent simulations. Here we approximate the joint multi-population AFS by numerical solution of a diffusion equation, and our implementation supports up to three simultaneous populations. Because the diffusion approach neglects linkage, our comparison with the data is through a composite likelihood function. Such likelihoods are consistent estimators under a wide range of population genetic scenarios for selectively-neutral data, but do not correctly capture variances [25]. (Lower recombination induces higher linkage and higher variance in the entries of the AFS.) As we demonstrate below, the efficiency of our diffusion approach enables both conventional and parametric bootstrap resampling of the data, allowing us to accurately estimate confidence intervals for parameter values and critical values for hypothesis tests [26], accounting for any degree of linkage found in the data. This bootstrap procedure overcomes the traditional concerns with composite likelihood as a philosophy for inference in population genetics. To demonstrate the utility of our approach, we apply our method to two epochs in human history, using single nucleotide polymorphism (SNP) data from the Environmental Genome Project (EGP) [27], the largest public database of human resequencing data. We first study the expansion of humans out of Africa, jointly modeling the history of African, European, and East Asian populations. We then study the settlement of the New World, jointly modeling European, East Asian, and admixed Mexican populations. In both cases, we quantify the uncertainty of our parameter inferences and test hypotheses about migration (bootstrapping to account for linkage). In particular, we infer an earlier divergence between African and Eurasian populations than previous studies, because our inferences account for the substantial migration between these populations. Our methods also find no evidence for multiple migrations between East Asia and the New World. While similarly complex models for human continental populations have been studied [8], to our knowledge, our analysis is the first in which the full joint AFS is used for inference and in which uncertainty and goodness-of-fit have been quantified. An important advantage of the diffusion approach is the ease with which selection can be incorporated. As an illustrative application, we also predict the distribution of protein-coding variation between populations. In agreement with the data, we find that less nonsynonymous variation is shared between populations than might be expected based only on patterns of shared noncoding variation. While no model can capture the full complexity of any species' genetic history, the models presented refine our understanding of the expansion of humanity across the globe. None of the methodology is specific to humans, and we expect our method will find wide application to demographic inference of other species. Methods Diffusion approximation To efficiently simulate the AFS, we adopt a diffusion approach. Such approaches have a long and distinguished history in population genetics, dating back to R. A. Fischer [28]–[30]. The diffusion approach is a continuous approximation to the population genetics of a discrete number of individuals evolving in discrete generations. An important underlying assumption is that per-generation changes in allele frequency are small. Consequently, the diffusion approximation applies when the effective population size is large and migration rates and selection coefficients are of order . If we have samples from populations, the numbers of sampled sequences from each population are . (For diploids, is typically twice the number of individuals sampled from population 1.) Entry of the AFS records the number of diallelic polymorphic sites at which the derived allele was found in samples from population 1, from population 2, and so forth. (If ancestral alleles cannot be determined, then the “folded” AFS can be considered, in which entries correspond to the frequency of the minor allele.) We model the evolution of , the density of derived mutations at relative frequencies in populations at time . (All run from 0 to 1.) Given an infinitely-many-sites mutational model [31] and Wright-Fisher reproduction in each generation, the dynamics of for an arbitrary finite number of populations are governed by a linear diffusion equation: (1) The first term models genetic drift, and the second term models selection and migration. Figure 1A illustrates the effects of different evolutionary forces on components of . Time is in units of , where is the time in generations and is a reference effective population size. The relative effective size of population is . The scaled migration rate is , where is the proportion of chromosomes per generation in population that are new migrants from population . (Thus migration is assumed to be conservative [32]). Finally, the scaled selection coefficient is , where is the relative selective advantage or disadvantage of variants in population . Boundary conditions are no-flux except at two corners of the domain, where all population frequencies are 0 or 1; these are absorbing points corresponding to allele loss or fixation. Because the diffusion equation is linear, we can solve simultaneously for the evolution of all polymorphism by continually injecting density at low frequency in each population (at a rate proportional to the total mutation flux ), corresponding to novel mutations. 10.1371/journal.pgen.1000695.g001 Figure 1 Frequency spectrum gallery. (A) Qualitative effects of modeled neutral genetic forces on , the density of alleles at relative frequencies and in populations 1 and 2. (B) For the spectra shown, an equilibrium population of effective size diverges into two populations generations ago. Populations 1 and 2 have effective sizes and , respectively. Migration is symmetric at per generation, and . (C) The AFS at . Each entry is colored by the logarithm of the number of sites in it, according to the scale shown. (D) The AFS at various times for various demographic parameters, on the same scale as (B). (E) Comparison between coalescent- and diffusion-based estimates of the likelihood of data generated under the model (A). Coalescent-based estimates of the likelihood, each of which took approximately 7.0 seconds, are represented in the histogram. The result from our diffusion approach, which took 2.0 seconds, is represented by the red line. For accuracy comparison, the yellow line indicates the likelihood inferred from coalescent simulations. Changes in population size and migration alter the parameters in Equation 1, while population splits and mergers alter the dimensionality of . For example, if new population 3 is admixed with a proportion from population 1 and from population 2 then (2) where denotes the Dirac delta function. To remove population 2, is integrated over : . Given , the expected value of each entry of the AFS, , is found via a P-dimensional integral over all possible population allele frequencies of the probability of sampling derived alleles times the density of sites with those population allele frequencies. For SNP data obtained by resequencing, these probabilities are binomial, so (3) In some cases of ascertained data [33], the resulting bias can be corrected by modifying the above equation [11],[34]. Likelihood-based inference Let correspond to the parameters of a demographic model we wish to estimate from the observed multi-population allele frequency spectrum, which we denote . Assuming no linkage between polymorphisms, each entry in the AFS is an independent Poisson variable [20], with mean (which depends on ). We can, therefore, construct a likelihood function using standard statistical theory: (4) So is the product of Poisson likelihoods, one for each entry in the AFS. In words, our approach consists of calculating the expected allele frequency spectrum using a particular demographic model (and set of parameter values for that demographic model) using our diffusion approach. We then maximize the similarity between and the observed AFS over the parameter values that can take on. Competing demographic models can be chosen from using standard statistical theory such as the nested likelihood ratio test or information criteria such as the Akaike or Bayesian Information Criteria. For linked polymorphisms, is a composite likelihood. Such likelihoods are consistent estimators under a wide range of neutral population genetic scenarios [25], but simulations incorporating linkage are necessary to estimate variances and define critical values for hypothesis testing and model selection. In our applications, we estimate variances using simulations from the coalescent simulator ms [35]. Numerics Solving the multi-population diffusion equation is substantially more demanding than the single-population case [23]. This is primarily because the boundary conditions are more complex, and the numerical grid of population frequencies must be much coarser to be computationally tractable, because it is of dimensions. For example, a previous single-population study [23] used a uniform grid of order values between 0 and 1. Extending this grid to a three-population simulation would require an infeasible array of size . Instead, we use a nonuniform grid and extrapolation to enable accurate computation using of order 100 values along each dimension, for a final array size of order . We solve the diffusion equation on a regular nonuniform grid, using a finite difference scheme [36] inspired by the method of Chang and Cooper [37] (Text S1). Mutations in population arise at frequency . The diffusion approximation applies when , but the minimum frequency in our numerical simulation is that of the first grid point, denoted . To overcome this, we extrapolate our results to an infinitely fine grid. We use a quadratic extrapolation on the logarithm of the AFS entry, modeling the bias introduced by the finite initial grid point as (5) Here is an AFS element calculated at grid size and is the extrapolated value. Given three evaluations at different grid sizes , we solve for and use this value when calculating likelihoods. This vastly increases both the speed and accuracy of our calculation (Supplementary Figure 3 in Text S1). While higher-order extrapolations may improve accuracy in some cases, they may also be more sensitive to numerical noise. Our empirical experience is that a quadratic approximation provides a good compromise between accuracy, efficiency, and robustness. The computational cost for a single likelihood evaluation scales as where is the number of grid points used. In our experience, for stability and accuracy should somewhat larger than the largest population sample size. Although our theoretical framework extends to an arbitrary number of populations, the exponential scaling of computation with limits our current applications to three simultaneous populations. Importantly, our likelihood calculation is deterministic and numerically smooth, so numerical derivatives can be used in optimization. We use the the quasi-Newton BFGS method [36], which converges in order steps, where is the number of free parameters. Our implementation of these methods, , is written in cross-platform Python and C, making use of the NumPy [38], Scipy [39], and Matplotlib libraries [40]. It is distributed under the open-source BSD license. All calculations herein were performed with version 1.1.0. We estimated parameter uncertainties by both conventional bootstrap (fitting data sets resampled over loci) and parametric bootstrap (fitting simulated data sets). To generate simulated data we used the coalescent program ms [35], a region-specific recombination rate, and the detailed EGP sequencing strategy (Text S1). The confidence intervals reported in Table 1 and Table 2 derive from a normal approximation to the bootstrap results. For the conventional bootstrap, confidence intervals were calculated as . For the parametric bootstrap, biased-corrected intervals were calculated as . The maximum-likelihood value is denoted , while and denote the mean and standard deviation of the bootstrap results. Aside from the growth rates , all our model parameters are positive by definition, so in those cases we used their logarithms when calculating confidence intervals. 10.1371/journal.pgen.1000695.t001 Table 1 Out of Africa inferred parameters. parametera Maximum likelihood conventional bootstrap 95% confidence interval parametric bootstrap bias-corrected 95% confidence interval 7,300 4,400–10,100 6,300–9,200 12,300 11,500–13,900 11,100–13,100 2,100 1,400–2,900 1,700–2,600 1,000 500–1,900 500–1,500 (%) 0.40 0.15–0.66 0.26–0.57 510 310–910 320–750 (%) 0.55 0.23–0.88 0.32–0.79 ( ) 25 15–34 19–36 ( ) 3.0 2.0–6.0 1.6–7.6 ( ) 1.9 0.3–10.4 0.7–6.9b ( ) 9.6 2.3–17.4b 5.7–20.2 (kya) 220 100–510 90–410 (kya) 140 40–270 60–310 (kya) 21.2 17.2–26.5 17.6–23.9 aSee Figure 2B for model schematic. Growth rates and migration rates are per generation. bOne low-migration outlier was removed for each of these estimations. 10.1371/journal.pgen.1000695.t002 Table 2 Settlement of New World inferred parameters. parametera maximum likelihood conventional bootstrap 95% confidence interval parametric bootstrap bias-corrected 95% confidence interval 1,500 700–2,100 900–2,200 (%) 0.23 0.08–0.45 0.16–0.34 590 320–800 410–790 (%) 0.37 0.16–0.60 0.24–0.51 800 160–1,800 140–1,600 (%) 0.50 0.14–1.17 0.41–0.98 ( ) 13.5 7.5–32.2 9.9–20.8 (kya) 26.4 18.1–43.1 21.7–30.7 (kya) 21.6 16.3–26.9 18.6–24.7 (%) 48 42–60 41–55 aSee Figure 3B for model schematic. Growth rates and migration rates are per generation. is the average European admixture proportion of the Mexican-Americans sampled. Pearson's goodness-of-fit test was performed using all 213−2 = 9259 bins in the AFS. Results are similar if we restrict our analysis to entries in which the expected value is greater than 1 or greater than 5. Data We used the National Institute of Environmental Health Science's Environmental Genome Project SNPs database [41], which results from direct Sanger resequencing of environmental response genes in several populations. We considered all diallelic SNPs in 5.01 Mb of sequence from noncoding regions of 219 autosomal genes (Supplementary Table 8 in Text S1). These data have been the subject of many publications, including [17],[23],[27],[42]. As an assessment of quality, additional high-coverage short-read sequencing has recently been performed across 8 samples in this data set. Over 26,000 sites, the SNP concordance between this next-generation sequencing and the original Sanger sequencing averages 99.5% (D. Nickerson, personal communication). Given the high quality of this data set, we do not incorporate sequencing error into our modeling. We believe such correction will be essential in future applications to less accurate short-read sequencing data, as inference based on the frequency spectrum is sensitive to rare alleles. To estimate the ancestral allele, we aligned to the panTro2 build of the chimp genome [43]. Like other methods based on the unfolded AFS, our analysis is sensitive to errors in identifying the ancestral allele. We statistically corrected the AFS for ancestral misidentification [17], using a context-dependent substitution model [44]. This procedure has been shown to perform better than aligning to multiple species [17]. To account for missing data and ease qualitative comparisons between populations, we projected all spectra down to 20 samples per population [5] (Text S1). The human-chimp divergence in the data is 1.13%. We assumed a divergence time of 6 My [45] and a generation time of 25 years. This yielded an estimated neutral mutation rate of per site per generation, which is comparable to direct estimates [46]. There is some controversy as to the appropriate generation time to assume in human population genetic studies [47],[48]. In particular, the human generation time may differ between cultures and may have changed during our biological and cultural evolution. The bootstrap uncertainties reported in Table 1 and Table 2 do not include systematic uncertainties in the human-chimp divergence or generation times. The generation time, however, formally cancels when converting between genetic and chronological times. Nonsynonymous polymorphism In our prediction of the distribution of nonsynonymous polymorphism, the distribution of selective effects assumed was a negative-gamma distribution with shape parameter and scale [49]. The AFS was calculated by trapezoid-rule integration over this distribution, using 201 evaluations logarithmically spaced over . All demographic parameters, including the scaled mutation rate , were set to the maximum-likelihood values from our Out of Africa analysis. Results First, we explored how various demographic forces affect the AFS, building intuition for our subsequent applications to real data. We then compared the performance of diffusion versus coalescent methods for evaluating the AFS, finding that the diffusion approach is substantially faster. We then applied our diffusion approach to infer parameters for plausible demographic models for the history of continental human populations. We first considered the expansion of humans out of Africa and then the settlement of the New World. In these applications, we inferred the maximum composite-likelihood parameters of our models using diffusion fits to the real data. To account for linkage in estimating variances and critical values for hypothesis tests, we then repeatedly fit both conventional and parametric bootstrap data sets. Finally, in an application incorporating selection, we predicted the distribution of nonsynonymous variation between populations in our Out of Africa model, finding good agreement with the available data. Demographic effects on the AFS In Figure 1, we provide examples of the AFS under different demographic scenarios. Figure 1B illustrates the isolation-with-migration model for which the spectra are calculated. The expected spectrum at zero divergence time is shown in Figure 1C. Figure 1D shows the expected spectrum at various divergence times under various demographic scenarios. Qualitatively, correlation between population allele frequencies declines with increasing divergence time, depopulating the diagonal of the AFS. On the other hand, migration prolongs and sustains correlation. Less obviously, AFS entries corresponding to shared low-frequency alleles distinguish between increased migration and reduced divergence time (Supplementary Figure 1 in Text S1). Additionally, differences in genetic drift between populations with different effective sizes result in asymmetries in the AFS. These qualitative features of the AFS are also evident in human data. Detailed modeling allows us to quantify our inference regarding the type, timing, and strength of demographic events that are consistent with the data. Computational performance The computer program implementing our method is named (Diffusion Approximations for Demographic Inference). It is open-source and freely available at http://dadi.googlecode.com. Figure 1E compares with a coalescent approach to evaluating the likelihood of frequency spectrum data. The coalescent simulator ms [35] was used to generate a simulated data set from the model in Figure 1B, with parameters , , , , , scaled total recombination rate , and 20 samples per population. Coalescent-based estimates of the expected AFS were generated by averaging ms simulations, each run with and . These estimates were scaled to for comparison with the simulated data set. (This procedure is substantially faster than simulating with larger and .) Each estimate took approximately 7.2 seconds of computation. The histogram in Figure 1E shows the resulting distribution of estimated likelihoods of the data. Shown by the red line in Figure 1E is the result from our diffusion approach (with grid sizes ), which took approximately 2.0 seconds of computation. The yellow line is the likelihood from coalescent simulations, illustrating the high accuracy of our diffusion approach. (Note that the coalescent approach we consider here is not necessarily optimal. We are, however, unaware of any such approach that is competitive in computational speed with the diffusion method.) The computational advantage of the diffusion method is even larger when placed in the context of parameter optimization. Unlike the coalescent approach, there is no simulation variance, so efficient derivative-based optimization methods can be used. As examples, consider our applications to human data, which involve 20 samples per population. On a modern workstation, fitting a single-population three-parameter model took roughly a minute, while fitting a two-population six-parameter model took roughly 10 minutes. The fits of three-population models with roughly a dozen parameters typically took a few hours to converge from a reasonable initial parameter set. This speed allows us to use extensive bootstrapping to estimate variances, overcoming the limitations of composite likelihood. Expansion out of Africa Our analysis of human expansion out of Africa used data from three HapMap populations: 12 Yoruba individuals from Ibadan, Nigeria (YRI); 22 CEPH Utah residents with ancestry from northern and western Europe (CEU); and 12 Han Chinese individuals sampled in Beijing, China (CHB). Because approaches based on the frequency spectrum are sensitive to miscalling of the ancestral state, we statistically corrected for ancestral misidentification using an approach that accounts for a myriad of mutation and context-dependent biases (such as CpG effects) [17]. To ease qualitative comparison among populations and account for missing data, we projected the data down to 20 sampled chromosomes per population [5]. Because this data set is of very high quality (>99% concordance of sequenced SNPs with next-generation sequencing of the same individuals to high coverage; see Methods), we do not explicitly correct for sequencing errors here. We were left with 17,446 segregating diallelic SNPs from effectively 4.04 Mb of sequence. Figure 2A shows the resulting AFS. For ease of visualization, the top row of Figure 2C shows the two-population marginal spectra. 10.1371/journal.pgen.1000695.g002 Figure 2 Out of Africa analysis. (A) AFS for the YRI, CEU, and CHB populations. The color scale is as in (C). (B) Illustration of the model we fit, with the 14 free parameters labeled. (C) Marginal spectra for each pair of populations. The top row is the data, and the second is the maximum-likelihood model. The third row shows the Anscombe residuals [61] between model and data. Red or blue residuals indicate that the model predicts too many or too few alleles in a given cell, respectively. (D) The observed decay of linkage disequilibrium (black lines) is qualitatively well-matched by our simulated data sets (colored lines). (E) Goodness-of-fit tests based on the likelihood and Pearson's statistic both indicate that our model is a reasonable, though incomplete description of the data. In both plots, the red line results from fitting the real data and the histogram from fits to simulated data. Poorer fits lie to the right (lower and higher ). (F) The improvement in likelihood from including contemporary migration in the real data fit (red line) is much greater than expected from fits to simulated data generated without contemporary migration (histogram). This indicates that the data contain a strong signal of contemporary migration. There are many possible three-population demographic models one could consider for these populations. To develop a parsimonious yet realistic model, we first considered the marginal AFS for each population and each pair of populations. Previous analyses found that the YRI spectrum is well-fit by a two-epoch model with ancient population growth [5],[17], and we found this as well (Supplementary Figure 6 in Text S1). Previous analyses of the CEU and CHB populations found that both populations went through bottlenecks [5],[11] concurrent with divergence [11]. Such models qualitatively fit our marginal CEU-CHB spectrum (Supplementary Figure 7 in Text S1). Combining these demographic features yields the model illustrated in Figure 2B. The maximum likelihood values for the 14 free parameters are reported in Table 1. Qualitatively, the resulting model reproduces the observed spectra well, as seen in the second and third rows of Figure 2C. (The correlation between adjacent residuals is due in part to our projection of the data down from a larger sample size (Supplementary Figure 8 in Text S1).) Allowing for asymmetric gene flow yielded very little improvement in fit, as did allowing for growth in the Eurasian ancestral population or allowing the CEU and CHB bottleneck and divergence times to differ (data not shown). Our composite likelihood function assumes that polymorphic sites are independent. Because it thus overestimates the number of effective independent data points, confidence intervals calculated directly from the composite likelihood function will be too small. To control for linkage, we performed both conventional and parametric bootstraps. Because our sequenced genes are typically well separated, they can be treated as independent, and our conventional bootstrap resampled from the 219 sequenced loci. For the parametric bootstrap, simulated data sets that incorporate linkage and the EGP's sequencing strategy were generated with ms [35]. Table 1 reports parameter 95% confidence intervals from both the conventional and bias-corrected parametric bootstraps. The parametric bootstraps yield slightly smaller confidence intervals than the conventional bootstrap, suggesting that some variability in the data has not been accounted for by our simulations. This variability may involve small varied selective forces on the sequenced regions or slight relatedness between sampled individuals. The parametric bootstrap results additionally show that our method possesses very little bias in parameter inference (Supplementary Figure 9 in Text S1). As seen in Table 1, the times for growth in the African ancestral population and divergence of the Eurasian ancestral population ( and ) have particularly wide confidence intervals, likely a consequence of the high inferred migration rate between the African and Eurasian ancestral populations. shows high correlation with the ancestral population size , while shows no strong linear correlation with any other single parameter (Supplementary Figure 11 in Text S1). We found that 92 out of our 100 conventional bootstrap fits yield , supporting the contention that the CHB population suffered a more severe bottleneck than the CEU population [11] (Supplementary Figure 11 in Text S1). We used several metrics to assess our model's goodness-of-fit, in additional to visual inspection of the residuals seen in Figure 2C. Figure 2D compares the decay of linkage disequilibrium (LD) in the data and in the parametric bootstrap simulations. The agreement seen is notable because our demographic inference used no LD information in building and fitting the model. This LD comparison thus serves as independent validation of both our model and bootstrap simulations. We also asked whether the likelihood found in the real data fit is atypical of fits to simulated data. Out of fits to 100 simulated data sets, 2 produced a smaller likelihood (worse fit) than the real data fit (Figure 2E), yielding a p-value of ≈0.02. One can craft examples in which a likelihood-based goodness-of-fit test fails to exclude very poor models [50]. Thus we also applied Pearson's goodness-of-fit test, a more robust and standard method for data that is in Poisson-distributed bins, such as the AFS [36]. In our case, we must use our parametric bootstraps to assess the significance of the sum-of-squared-residuals test statistic , because many entries in the AFS are small and because they are not strictly independent. Figure 2E shows the bootstrap-derived empirical distribution of . Two of the bootstraps yielded a larger (worse fit) than the real data fit, giving a p-value of ≈0.02, identical to that from the likelihood-based test. (The two simulations that yield a higher than the real fit are not the same two that yield a lower , suggesting that these tests are somewhat independent.) In some cases specific frequency classes of SNPs, such as rare alleles, may be of particular interest. In Supplementary Table 5 in Text S1, we provide comparisons of the joint distribution of rare alleles seen in the data with that from our simulations. These comparisons indicate that our model also reproduces well this interesting region of the frequency spectrum. Finally, in Figure 4 we compare the model and data using larger bins of SNPs specific to particular populations or segregating at high or low frequency. In all cases the model agrees within the uncertainty of the bootstrapped data. Taken together, these tests suggest that our model provides a reasonable, though not complete, explanation of the data, lending credence to our demographic estimates. The inferred contemporary migration parameters ( , and ) are small, raising the question as to whether they are statistically distinguishable from zero. Figure 2F shows that the improvement in fit to the real data upon adding contemporary migration to the model is much larger than would be expected if there were no such migration, implying that the contemporary migration we infer is highly statistically significant. Omitting ancient migration ( ) reduced fit quality even more, indicating that the data also demand substantial ancient migration (data not shown). Settling the New World To study the settlement of the Americas, we used the previously considered 22 CEU and 12 CHB individuals, plus an additional 22 individuals of Mexican descent sampled in Los Angeles (MXL). Data were processed as in our Out of Africa analysis, yielding 13,290 segregating SNPs from effectively 4.22 Mb of sequence. Figure 3A shows the resulting AFS, while Figure 3C shows the marginal spectra. 10.1371/journal.pgen.1000695.g003 Figure 3 Settlement of the New World analysis. As in Figure 2, (A) is the data, (B) is a schematic of the model we fit, (C) compares the data and model AFS, and (D) compares LD. (E) The fit of our model to the real data is not atypical of fits to simulated data. (F) The improvement in real data fit upon including CHB-MXL migration (red line) is very typical of the improvement in fits to simulated data without CHB-MXL migration. Thus we have no evidence for CHB-MXL migration after divergence. A model in which the CEU and CHB diverge from an equilibrium population did not reproduce the AFS well (Supplementary Figure 13 in Text S1). Interestingly, a model allowing a prior size change in the ancestral population better fit the AFS but very poorly fit the observed LD decay (Supplementary Figure 13 in Text S1). Thus, reproducing the AFS does not guarantee reproduction of LD, at least given a historically unrealistic model. To develop a more realistic model, we endeavored to include the effects of Eurasian divergence from and migration with the African population. Computational limits precluded us from considering all 4 populations simultaneously, so we dropped the African population from the simulation upon MXL divergence (Figure 3B). Table 2 records the maximum-likelihood parameter values inferred for this model. Because this fit did not include African data, we could not reliably infer demographic parameters involving the African population. Thus, for this point estimate we fixed the Africa-related parameters , , , , , , and to their maximum-likelihood values from Table 1. Figure 3C compares the model and data spectra. The residuals show little correlation, with the possible exception that the model may underestimate the number of high-frequency segregating alleles. Parameter confidence intervals are reported in Table 2. To account for our uncertainty in those parameters derived from the Out of Africa fit, for each conventional bootstrap fit we used a set of Africa-related parameters randomly chosen from the sets yielded by our Out of Africa conventional bootstrap. For the parametric bootstrap, we used the maximum-likelihood point estimates. Again, we see that the conventional bootstrap confidence intervals are comparable to, although slightly wider than, the parametric bootstrap intervals. Several parameters in this analysis have direct correspondence with our Out of Africa analysis. Of particular note, the confidence intervals for the CEU-CHB divergence time overlap. In assessing goodness of fit, Figure 3D shows that this model does indeed reproduce the observed pattern of LD decay. Unlike in our Out of Africa analysis, however, here the LD decay was used to choose the form of the model (although not its parameter values), so this is not a completely independent assessment of fit. Of our 100 parametric bootstrap fits, 13 yielded a worse likelihood than the real fit (Figure 3E), for a p-value of ≈0.13. Applying Pearson's test, we find that 23 of 100 bootstrap fits yield a higher (worse) than the fit to the real data, for a p-value of ≈0.23, similar to that of the likelihood analysis. Comparing distributions of rare alleles, our model typically reproduces the observed distribution well, although it may be somewhat overestimating the proportion of alleles that are rare or absent in the CHB population (Supplementary Table 7 in Text S1). In sum, our model appears to be a reasonable explanation of this data, somewhat better than in our Out of Africa analysis. An essential feature of the Mexican-American individuals considered here is that they are typically admixed from Native American and European ancestors. The ≈50% average European admixture proportion we inferred for the MXL population is consistent with previous estimates for Los Angeles Latinos [51]. We have no direct data from the Native American populations ancestral to MXL, but our model does account for their divergence from East Asia. A model neglecting this divergence (by setting to zero) fit the data substantially worse and yields an unrealistically high average European admixture proportion into MXL of 68%. Not only are Mexican-American individuals admixed, their admixture proportions also vary, and this subtlety is not directly accounted for in our analysis. To assess its effect on our results, we first roughly estimated the ancestry proportion of each individual, using essentially a maximum-likelihood version [18] of the algorithm used in structure [52] (Text S1). (Methods based on “admixture LD”, which identify breakpoints between regions of Native American and European ancestry, may be more powerful [53]. However, the strategy used by the EGP of sequencing widely spaced genes will resolve few of these breakpoints, limiting the applicability of these methods.) We then performed additional parametric bootstrap analyses, using simulations with a distribution of individual ancestry chosen to mimic that seen in the data and, to further test the method, with an extremely wide distribution. These simulations showed that variation in individual ancestry does not bias our parameter inferences (Supplementary Figure 19 in Text S1). Remarkably, it does not even change our statistical power. This is evidenced by the fact that these bootstrap simulations yielded confidence intervals identical to our original simulations without variation in ancestry proportion (Supplementary Figure 19 in Text S1). Nevertheless, future studies may profit by incorporating individual ancestry information [18], perhaps inferred from admixture LD. Finally, our model allowed us to assess the role recurrent migration from Asia played in the settlement of the New World [2]. When we added CHB-MXL migration to our model, we found that the maximum likelihood migration rate was per generation. As shown in Figure 3F, the resulting improvement in likelihood is typical (p-value≈0.45) of fits including CHB-MXL migration to data simulated without it. Our data and analysis thus yielded no evidence of recurrent migration in the settlement of the New World. Note, however, that this simple test does not necessarily rule out more complex scenarios, in which migration may vary over time. Nonsynomymous polymorphism Polymorphisms that change protein amino acid sequence are of medical interest because they are particularly likely to affect gene function [54]. Correspondingly, they are often subject to natural selection. Diffusion approaches are particularly useful for studying such nonsynonymous polymorphism, because they easily incorporate selection. Although the diffusion approximation assumes that sites are unlinked, nonsynonymous segregating sites are rare enough that this is often a reasonable approximation [49]. As an illustration, we used our Out of Africa demographic model to predict the distribution of such variation between continental populations. To do so, we must specify a distribution for the selective effects of nonsynonymous mutations that enter the population. For this we adopted a negative gamma distribution whose parameters were recently inferred [49]. The resulting distribution of segregating variation is shown in Figure 4A. (To ease comparison, we have assumed the same scaled mutation rate as in the neutral case of Figure 2C.) As expected, selection sharply reduces the amount of segregating polymorphism. Figure 4B shows the proportion of variants within various classes. Also as expected, selection shifts nonsynonymous variation toward lower frequencies, raising the proportion of singletons and lowering the proportion at frequency greater than 10%. Less obviously, it also reduces the proportion of variation that is shared between populations. In the neutral case, 43% of polymorphism is predicted to be present in more than one population, while in the selected case only 35% is. Thus genetic inferences from coding polymorphism may be less transferable between populations than might be expected from neutral patterns of allele sharing. 10.1371/journal.pgen.1000695.g004 Figure 4 Distribution of nonsynonymous polymorphism. We simulated our maximum-likelihood Out of Africa demographic model with a distribution of selective effects previously inferred for nonsynonymous polymorphism [49]. (A) To enable direct comparison with the neutral AFS (Figure 2C), the scaled mutation rate was set identically, as is the color scale. As expected, selection dramatically reduces the amount of segregating polymorphism. (B) Shown are the proportions of variation found in various frequency classes. As expected, nonsynonymous variants typically have lower frequency. They also less likely to be shared between populations. Data error bars indicate 95% bootstrap confidence intervals. In the data considered here, there are about 400 nonsynomymous polymorphisms segregating in the three populations considered. This is too few for a detailed goodness-of-fit test of our predicted distribution. (Although see Supplementary Figure 20 in Text S1 for a direct AFS comparison.) Nevertheless, we observe that our predictions shown in Figure 4B all lie within the bootstrap 95% confidence intervals from the data. Discussion Our diffusion approximation to the joint allele frequency spectrum is a powerful tool for population genetic inference. Although the diffusion approximation neglects linkage between sites, our method's computational efficiency allows us to use extensive bootstrap simulations to account for the effects of linkage. (Let us reiterate that linkage does not affect the expected allele frequency spectrum of neutral sites, so our diffusion-based approach is estimating the same AFS that coalescent simulations are estimating, but in a small fraction of the time). We applied our method to human expansion out of Africa and settlement of the New World, using public resequencing data from the Environment Genome Project. The flexibility of the diffusion approach also allowed us to consider the distribution of non-neutral variation, which is difficult to address with other approaches. Although no model can capture in detail the complete history of any population, the models presented here help refine our understanding of human expansion across the globe. Our demographic results are in most respects broadly consistent with previous analyses of human populations. In particular, single-population analyses have also inferred African population growth and European and Asian bottlenecks [4]–[6]. Also, the migration rates we infer are similar to those inferred by Schaffner et al. [8] but somewhat smaller than those of Cox et al. [15]. On the other hand, Keinan et al. [11] inferred no significant migration between CEU and CHB. Finally, our estimate of a New World founding effective population size in the hundreds is compatible other inferences [14]. Perhaps our most interesting demographic results are the inferred divergence times. Other studies [11],[12] have estimated divergence times between Europeans and East Asians similar to the ≈23 kya we infer. Interestingly, archeological evidence places humans in Europe much earlier (≈40 kya) [1]. Our inferred divergence time of ≈22 kya between East Asians and Mexican-Americans is somewhat older than the oldest well-accepted New World archeological evidence [2]. The divergence we infer may reflect the settlement of Beringia, rather than the expansion into the New World proper [14]. Finally, the divergence time of ≈140 kya we infer between African and Eurasian populations is consistent with archeological evidence for modern humans in the Middle East ≈100 kya [1], but it is much older than other inferences of ≈50 kya divergence from mitochondrial DNA [1]. This discrepancy may be explained by our inclusion of migration in the model. Migration preserves correlation between population allele frequencies, so an observed correlation across the genome can be explained by either recent divergence without migration or ancient divergence with migration. In fact, the African-Eurasian migration rate we infer of ≈25×10−5 per generation is comparable to the ≈100×10−5 inferred from census records between modern continental Europe and Britain [55]. One difficulty in interpreting our divergence times is that the sampled populations may not best represent those in which historically important divergences occurred. For example, the Yoruba are a West African population, so the divergence time we infer between Yoruba and Eurasian ancestral populations may correspond to divergence within Africa itself. Future studies of more populations [56]–[58] will help alleviate this difficulty. Another difficulty is that the genic loci we study here may not be ideal for demographic inference. Although we consider only noncoding sequence in fitting our historical model, selection on regulatory or linked coding sites may skew the AFS [59]. In fact, the EGP data have been shown to differ in some ways (e.g. Tajima's ) from intergenic regions [58]. Nevertheless, we use the EGP data because it is currently the largest public resource of noncoding human genetic variation, and we fit a neutral model because disentangling the small expected effects of selection on these sites from demographic effects will require additional data. The rapidly declining cost of sequencing will give future studies access to many more loci that are likely to be less influenced by selection. Importantly, the computational burden of our method is independent of the amount of sequence used to construct the AFS. Additional loci will also increase power to discriminate between models and incorporate more detail. The AFS encodes substantial demographic information. It is has been shown, however, that an isolated population's AFS does not uniquely and unambiguously identify its demographic history [60]; we expect a similar result to hold for multiple interacting populations. Moreover, the AFS does not capture all the information in the data. As illustrated by the alternative New World models we considered, patterns of linkage disequilibrium encode additional information. Future studies may profit from coupling our efficient AFS simulation with methods that address other aspects of the data. We have developed a powerful diffusion-based method for demographic inference from the joint allele frequency spectrum. We applied our method to human expansion out of African and the settlement of the New World, developing models of human history that refine our knowledge and raise intriguing questions. We also applied our method to predict the distribution of nonsynonymous variation across populations, and this prediction is consistent with the available data. Our methods and the models inferred from it offer a foundation for studying the history and evolution of both our own species and others. Supporting Information Text S1 Complete supplementary data. (1.80 MB PDF) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Genet
                PLoS Genet
                plos
                plosgen
                PLoS Genetics
                Public Library of Science (San Francisco, USA )
                1553-7390
                1553-7404
                December 2013
                December 2013
                26 December 2013
                : 9
                : 12
                : e1004023
                Affiliations
                [1 ]Department of Human Genetics, McGill University, Montréal, Québec, Canada
                [2 ]McGill University and Génome Québec Innovation Centre, Montréal, Québec, Canada
                [3 ]Department of Genetics, Stanford University, Stanford, California, United States of America
                [4 ]Ancestry.com DNA LLC, San Francisco, California, United States of America
                [5 ]Laboratorio de Genética Molecular Poblacional, Instituto Multidisciplinario de Biología Celular (IMBICE). CCT- CONICET-La Plata, Argentina and Facultad de Ciencias Naturales y Museo, Universidad Nacional de La Plata, La Plata, Argentina
                [6 ]Weill Cornell Medical College, New York, New York, United States of America
                [7 ]Department of Genetics and Genomic Sciences, The Charles Bronfman Institute for Personalized Medicine, Center for Statistical Genetics, and Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
                [8 ]Department of Bioengineering and Therapeutic Sciences and Medicine, Univeristy of California San Francisco, San Francisco, California, United States of America
                [9 ]Department of Biology, University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico
                [10 ]Department of Biochemistry, Ponce School of Medicine and Health Sciences, Ponce, Puerto Rico
                [11 ]Department of Psychiatry and Clinical Psychobiology, University of Barcelona, Barcelona, Spain
                [12 ]Universidad de Antioquia, Medellín, Colombia
                [13 ]Department of Genetics, Evolution and Environment, University College London, London, United Kingdom
                Dartmouth College, United States of America
                Author notes

                ¶ Membership of The 1000 Genomes Project can be found in the acknowledgments or at http://www.1000genomes.org/participants.

                The authors have declared that no competing interests exist.

                Conceived and designed the experiments: SG MV KS TKO ARL EGB JCMC CDB. Analyzed the data: SG FZ JKB MM AME JLRF EEK CRG WG. Contributed reagents/materials/analysis tools: SG FZ JKB AME CRG BKM JD GB TKO ARL EGB. Wrote the paper: SG MM AME CDB.

                Article
                PGENETICS-D-13-01173
                10.1371/journal.pgen.1004023
                3873240
                24385924
                2b703d56-f65b-475f-b49a-a98593aa2c64
                Copyright @ 2013

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 6 May 2013
                : 29 October 2013
                Page count
                Pages: 13
                Funding
                This study was supported by NSF grant 7188155, NHGRI grant HG005715, and NIH R01 GM090087 (to CDB), UCSF Chancellors Research Fellowship, Dissertation Year Fellowship, and in part by NIH Training Grant T32 GM007175 (to CRG); 1P60 MD006902, R01 HL088133, R01 ES015794, RWJF Amos Medical Faculty Development Award, the Sandler Foundation; the American Asthma Foundation (to EGB), and the Pew Latin American Fellows Program (MM). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article

                Genetics
                Genetics

                Comments

                Comment on this article