23
views
0
recommends
+1 Recommend
1 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found

      Tuberculosis origin: The Neolithic scenario

      Tuberculosis
      Elsevier BV

      Read this article at

      ScienceOpenPublisher
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Related collections

          Most cited references27

          • Record: found
          • Abstract: found
          • Article: not found

          Origin, Spread and Demography of the Mycobacterium tuberculosis Complex

          Introduction The Mycobacterium tuberculosis complex (MTBC) is composed of closely related bacterial sub-species that have plagued human and animal populations for thousands of years. The most famous member of the MTBC is M. tuberculosis, the etiological agent of tuberculosis in humans that killed 1.7 million people in 2004 according to the World Health Organization [1]. A new threat is the worldwide emergence of multi-drug resistant (MDR) and extremely drug-resistant (XDR) strains. Recent data suggest that the propensity to gain drug resistance as well as the pathogen's transmissibility profile may be influenced by the genetic and evolutionary background of M. tuberculosis strains [2]. Thus, understanding the relationships and dynamics of the MTBC lineages will undoubtedly help to unravel the basis for the considerable success and spread of tuberculosis, in both humans and animals. The MTBC is essentially clonal with little evidence of horizontal gene exchange [3],[4],[5], and probably derived from a pool of ancestral tubercle bacilli, collectively called “Mycobacterium prototuberculosis” [6]. However, despite the highly successful worldwide spread of the MTBC, the evolutionary timing of this spread remains largely unknown. This lack of knowledge is largely due to the limitations of the genetic markers used so far. All efforts to time MTBC evolution with single nucleotide polymorphisms (SNPs) have been based on a non-warranted hypothesis of universal bacterial mutation rates, itself extrapolated from a very hypothetical time of divergence between Escherichia coli and Salmonella enterica [7]. In this study, we used a completely new approach by employing genetic markers based on mycobacterial interspersed repetitive units (MIRUs) to determine the timing of divergence, population diversity and spread of the MTBC. MIRU loci comprise variable numbers of tandem repeat (VNTR) sequences, which allow them to be used as powerful genotyping markers [8],[9]. In terms of genetic diversity and mutation rates, they resemble human microsatellites, which are widely used in human population genetics studies [10]. Similar to microsatellites, MIRUs behave as selectively neutral phylogenetic markers if sufficient numbers of loci are used to buffer against potential biases. Here we used experimental data on the variability and evolution of these markers in clinical isolates of infected patients, which allowed us to calculate the MIRU molecular clock and model their evolution in coalescence approaches. Based on this information and extensive analysis of a large collection of representative MTBC strains, we obtained new insights into the origin and demography of the MTBC and its dynamic association with the human host. Results M. tuberculosis phylogeny To infer the MTBC evolutionary history, we used a sample collection of 355 isolates, representative of well-identified primary branches of the MTBC world distribution (Table S1). A recently standardized combination of 24 MIRU loci (Figure S1), which does not comprise saturated loci [11], was utilized. To illustrate the power of MIRUs to reconstruct geographical patterns of genetic differentiation and their level of resolution, a distance-based tree was constructed using individual genotypes and a neighbour-joining algorithm (Figure 1A). The tree grouped all M. tuberculosis sensu stricto isolates (all from human patients) in a distinct lineage with the notable exception of the East African-Indian (EAI) population whose affiliation is unclear based on this approach. Another major lineage encompassed all MTBC strains from animals (M. microti. M. bovis, M. caprae and M. pinipedii) and the human isolates from West-Africa (M. africanum West African 1 and 2). From the resulting tree, it appears that the groupings of isolates within the primary MTBC branches based on SNPs, spoligotyping and large sequence polymorphisms (LSPs) [12],[13],[14],[15],[16],[17] (Figure S2) are highly congruent with those based on the MIRU typing, albeit the branch resolution was higher in the latter. In order to more robustly define the relationships between the lineages (by reducing the number of individuals vs the number of markers), we then grouped individual isolates into the populations defined by the above groupings and built a tree based on MIRU allelic frequencies in these populations (Figure 1B). The tree was rooted with samples of M. prototuberculosis (including M. canettii), which was recently reported to represent the progenitor of the MTBC [6]. This approach clearly revealed the distinctiveness of the two major lineages with strong bootstrap support, called hereafter clades 1 and 2. A further geographic sub-structuring within clade 1 became apparent, with distinct branches for the African (Uganda, Cameroon and S), Asian (Beijing and CAS), Latin American-Mediterranean and African-European populations (X, Ghana and Haarlem). Clade 2 is composed of both animal and human pathogenic isolates. A basal position of EAI (human tuberculosis) in clade 2 has strong statistical support, indicating a human origin for this predominantly animal-associated MTBC lineage. However, low bootstrap values within clade 2 prevent us from drawing further inferences on the branching order. 10.1371/journal.ppat.1000160.g001 Figure 1 Evolutionary relationships of the Mycobacterium tuberculosis complex. (A) Unrooted MIRU Neighbour-joining phenogram depicting genetic distance relationships among tubercle bacilli isolates based on Nei et al.'s DA distances. (B) Rooted MIRU population Neighbour-joining tree based on genetic distance. M. prototuberculosis was used as an outgroup. Values on the nodes represent the percentage of bootstrap replicates over individuals (N = 1000) showing the particular nodes. Branch lengths are proportional to the genetic distance between the tubercle lineages. It is noteworthy that low bootstrap values within clade 2 prevent us from drawing further inferences on the branching order in this clade (see also main text). Wa, West-Africa. (C) Population structure of 20 MTBC clonal lineages using the no-admixture model, where K = 3. Each colour represents one cluster, and the length of the color segment shows the strains' estimated proportion of membership in that cluster. Results shown are averages over 10 STRUCTURE runs. For clarity, strains codes are also given according to Gagneux et al. (2006). A population genetics perspective To confirm the groupings and the deep dichotomy obtained with the MIRUs, we used an independent approach, based on the ‘no-admixture’ model of the STRUCTURE program [18]. In this Bayesian approach, multilocus genotypic data are used to define a set of populations with distinct allele frequencies and assign individuals probabilistically to them, with or without prior knowledge of geographic sampling information. We applied STRUCTURE to the global data set (including the outgroup) and in ten independent runs, at K = 3 populations (Fig. 1C) STRUCTURE detected the same two deeply divergent clades 1 and 2 that were identified with the neighbour joining analysis (see Figure 1B). Notably, this separation is independently supported by the fact that TbD1 (M. tuberculosis deletion 1) is lacking in all clade 1 strains but present in all clade 2 strains, including those from EAI (Figure 1B and S2) [12]. The robustness of these clades was further evidenced by STRUCTURE analysis, because each isolate derived all of its MIRU's from only one of the three ancestral sources of clade 1, clade 2 or M. prototuberculosis (see Protocol S1). We further modelled the Bayesian assignments of the two main clades by sub-dividing them into additional clusters (Figure S3A). The bacterial isolates were consistently split into the same major clusters as those defined by the distance-based approach (see above). The highest likelihoods were obtained for K = 6 populations in each of the two main clades. Only three isolates (0.85%) were assigned to unexpected clusters by the Bayesian approach (Figure S3A), further illustrating the consistency of MIRU-VNTR cluster designations. To detect possible horizontal genetic transfer events, we used the STRUCTURE ‘linkage model’ as was done to detect ongoing genetic exchange in Helicobacter pylori [19],[20], Escherichia coli [21] and Moraxella catarrhalis [22]. Runs without prior knowledge of population source (Figure S3B) suggested that the vast majority of the MTBC strains are clonal, while some M. prototuberculosis strains might be hybrids with MTBC genotypes, in accordance with previous results [3],[5],[6]. MTBC ancestral lineages and genetic diversity To further assess the deep dichotomy, we calculated the allelic richness (the number of alleles) of the populations within the two main clades after correcting for sample size effects [23] (Fig. 2). High levels of genetic diversity are a surrogate indication of ancestral origins as illustrated in the highly divergent African human populations. The mean allelic richness per locus was close to five for both clades, and the difference was not significant (Fig. 2C), arguing for a simultaneous split of the two clades. As expected, LAM and EAI, the most basal populations in clades 1 and 2 respectively, contained the highest number of alleles (Fig. 2A, 2B). However, some uncertainty remains on a basal position for LAM because it conflicts with groupings based on internal deletions of the pks15/1 gene and on SNPs [13]. 10.1371/journal.ppat.1000160.g002 Figure 2 Genetic variability in the different MTBC lineages. (A and B) MIRU allelic richness in each population within clade 1 and 2 respectively. Rarefaction included eight isolates per population (smaller populations were not considered in this analysis. (C) Clades mean allelic richness. Notice that the difference between clade 1 and 2 is not significant (t-test, P = 0.08). Dating the disease and the evolutionary radiation steps In order to estimate the time to the most recent common ancestor (TMRCA) in the MTBC, we made use of recent analytical tools [24],[25], which make these estimations possible. They rely on Bayesian statistics and apply a stepwise mutation model (SMM) for genetic markers. This model is a reasonable assumption for MIRU mutations, as initially shown for MIRU locus 4 in the BCG evolutionary framework [9]. To test the validity of this model for the total set of the MIRU loci used, we built a minimal spanning tree of all MTBC strains based on the degree of allele sharing. We then evaluated the proportion of strains that differed from their closest relative by one step (single-locus variants- SLVs) or by multiple steps, which would violate the SMM model. This simple method will certainly overestimate any violations of the SMM model because our sampling scheme is not exhaustive, resulting in some spurious missing links (intermediate strains) that falsely invalidate the SMM model. However, the data showed that at least 64% of the allelic changes fit the stepwise mutation model, a result that is close to the 75% and 81% observed in E. coli and yeast VNTRs, respectively [26],[27]. To further evaluate the validity of the SMM model, eBURST analysis was performed on a much larger dataset comprising 1,733 MIRU-VNTR profiles from two population-based studies performed at regional and national levels (see Materials and Methods). This analysis identified 142 groups and 1061 singletons. In order to determine whether tandem repeats evolve following a SMM model and to detect a potential bias towards increase or decrease in repeat numbers, we computed within each eBURST group all differences in number of repeats along the evolutionary path, starting from the putative founder of the group to its surrounding SLVs (Figure S4). For all but two of the 24 loci, the most frequent change was either −1 or +1 repeat unit, with the symmetric change generally being the next most frequent. The only minor exceptions were loci MIRU-VNTR 3007 and 2347, which contain little information, because the only changes were one occurrence each of −2 and +1 repeat units, and four occurrences of −2 and two of +1 respectively. Both for the individual loci and for data cumulated over the 24 loci (Figure S5), the distribution of occurrences was unimodal and centered on 0 (average of −0.07±0.23, CI = 0.95, for cumulated data). At least sixty-five percent of the allelic changes matched the stepwise mutation model. It is noteworthy that missing links falsely invalidating the SMM model probably occur even in this population-based dataset, because many patients from the population studied (from the Brussels region and the entire Netherlands) were foreign-born and have probably acquired their infection abroad. Therefore, tandem repeats in M. tuberculosis most frequently evolve by progressive gain or loss of single repeat units without significant general bias towards increase or decrease. To estimate MIRU mutation rates, we used data from large sets of serial or epidemiologically-linked isolates. The probability of showing a repeat change over periods of up to 7 years was estimated to be about 1% for five of the most variable loci [11]. This corresponds to a single-locus mutation rate of 1.4×10−3 per year. Consistently, 4 of these 5 loci composed the top 4 in the hierarchy of single-locus variation frequencies measured among the MIRU loci, both in a global MTBC isolate dataset [11] and in the above population-based dataset (data not shown). This supports the use of these frequencies as a surrogate for estimating relative mutation rates of the different markers, and especially those of the less variable loci, for which repeat changes among serial or epidemiologically-linked isolates were not observed [11]. We therefore somewhat arbitrarily chose a lower mean mutation rate per year of 10−4 as a prior for the Bayesian inferences [25] over all loci, in order to accommodate the less variable loci which were associated with up to 38fold lower frequencies of single-locus variation. It is noteworthy that this initial value was well supported by posterior Bayesian analysis, as the calculated posterior mean for the mutation rate was 10−3.91 (Figure 3). By applying this mutation rate and a generation time of one day for the tuberculosis bacilli, we estimated a mean TMRCA of ≈40,000 years before present for the complex (Table 1). The TMRCAs for clades 1 and 2 were estimated as 21,000 and 33,000 years, respectively, and two of the oldest lineages, EAI and LAM coalesced at 13,700 and 7,000 years, respectively (Table 1). 10.1371/journal.ppat.1000160.g003 Figure 3 Calculated posterior mean for MIRU-VNTR mutation rate among loci using the MSVAR algorithm. This graph corresponds to the output obtained for the Haarlem population sample and the 95% interval confidence is given (red dotted lines). 10.1371/journal.ppat.1000160.t001 Table 1 Estimated Times (in years) since the most recent common ancestor (TMRCA). TMRCA Age in years CIs Hierarchic level LAM-Beijing 21,300 (14,300–31,600) Clade 1 Beijing-CAS 17,100 (11,600–25,400) Asian TB LAM-LAM 7,060 (4,370–11,100) LAM CAS-CAS 9,450 (6,100–14,700) CAS EAI-WA2 32,800 (27,900–38,300) Clade2 EAI-EAI 13,700 (9,100–21,000) EAI M. bovis-M. bovis 5,750 (4,560–7230) M. bovis EAI-LAM 41,500 (29,100–60,000) MTBC EAI-Beijing 37,500 (25,800–55,100) MTBC Estimates and 95% confidence intervals were calculated with the software YTime. In a second step, we used the MSVAR software [25] that infers past demographic changes and calculates additional parameters, including TMRCA of monophyletic populations using slightly different algorithms. For this procedure, we focused on lineages for which at least 30 isolates were included in the study, in order to avoid small sample size artefacts. The use of this method confirmed the TMRCA of the EAI population at ≈7,000 years (Figure 4B and Table 2), albeit with very wide confidence intervals (150–190,000 years). 10.1371/journal.ppat.1000160.g004 Figure 4 Detection of recent expansion in different MTBC lineages. (A) Posterior distribution of M. bovis TMRCA, including the 95% confidence interval and density plots of the marginal posterior distribution of log (N0 ), where N0 is the current effective number of chromosomes and log (Nd ), where Nd is the number of chromosomes before expansion. (B) Same plots for EAI. ta is expressed in years (±SD) and denotes the time that has elapsed since the population growth began. 10.1371/journal.ppat.1000160.t002 Table 2 Time to the most recent common ancestor (TMRCA), time elapsed since the last expansion began (ta ) and growth rate estimates based on the MSVAR software. TMRCA ta Growth lower modal upper lower modal upper modal Africa 2.024 3.510 5.085 0.418 2.193 4.604 0.60 Asia 2.126 3.620 5.022 0.833 2.006 3.774 2.16 Europe 2.257 3.701 5.029 0.710 2.321 3.527 2.12 Beijing 0.939 3.040 5.396 0.566 2.514 4.421 2.71 CAS 1.865 3.540 5.156 0.389 2.345 3.624 1.67 LAM 2.378 4.007 5.758 1.341 2.989 4.381 1.80 EAI 2.208 3.854 5.282 0.134 2.145 6.274 0.81 M. bovis 2.379 3.687 4.859 1.316 3.184 5.222 1.83 Modal values and 95% confidence intervals are presented. The results are on a log scale. M. tuberculosis demographic expansion Finally, genetic data can also unravel recent demographic change signatures in bacterial populations. By using Bayesian statistics, we tested whether a recent decline or expansion occurred in the MTBC population, and calculated ta , which reflects the time that has elapsed since the decline or expansion began. All MTBC populations from human sources that we considered displayed markedly consistent expansion rates and EAI is typical in that respect (see Figure 3B). The detected growth rates (on a log scale) ranged from a modest 0.6 value, as seen in Africa, to 2.7 for Beijing, which is probably the most successful present day lineage. This latter value translates into a recent 500-fold population size increase. The mean modal value of log10 ta was 2.25 (range 2.00–2.5) for the different populations, with the exception of the LAM lineage. This corresponds to a tuberculosis expansion that began 180 years ago (see Table 2). Discussion Taken together, the findings presented in this study indicate that the MTBC is composed of two major lineages and has emerged approximately 40,000 years ago. This estimate is strikingly close to the proposed time of dispersal of founder modern human populations from the Horn of Africa [28]. However this dating must be considered with caution in the light of the large confidence intervals. Our results support the emergence of the MTBC clone from the M. prototuberculosis progenitor pool and its co-migration with modern humans out of Africa [6]. A similar trend was recently proposed for H. pylori and M. leprae [29],[30]. We suggest that two main lineages arose later some 20,000 to 30,000 years ago from the common MTBC ancestor, one of which spread exclusively among humans, with subsequent waves of migration to Asia, Europe and continental Africa (Figure 5). This spreading scenario fits well with the current worldwide distribution of the main MTBC lineages, as reflected by the SpolDB4 database [12],[13],[14],[15],[16],[17] and LSP analysis [14],[17]. The second lineage (clade 2) arose from a human EAI-like population some 30,000 years ago and is the probable source of animal tuberculosis [12],[31], a derivation that is strongly and convergently supported by both distance-based and probabilistic methods (i.e. NJ and STRUCTURE). This conclusion is consistent with the finding that extant representatives of M. tuberculosis, which derived from the proposed progenitor of MTBC, are human pathogens [6]. Thus it is likely that humans infected their livestock and not the other way around. Clade 2 secondary branches include M. bovis and M. caprae, the infectious agents of tuberculosis in a wide variety of animals including cattle and goat, which were first domesticated in the Near East [32],[33]. The transition from human to animal hosts may thus be linked to plant and animal domestication that took place in the Fertile Crescent some 13,000 years ago. This period corresponds to the estimated time of diversification of the oldest EAI and LAM populations (Table 2). In the Fertile Crescent, and during that era of human history, small nomadic hunter-gatherer groups were replaced by farming societies based on domesticated livestock and crops [34]. This paramount event in human history was probably not without consequence for an epidemic, infectious disease such as tuberculosis, where crowded farming populations may have promoted high infection rates, bacterial spread and transition to new niches and animal hosts [35]. Clade 2 also includes M. africanum strains that primarily infect humans. However, it has recently been speculated that M. africanum may not be primarily adapted to the human host but might have originated from an unknown animal reservoir [36]. 10.1371/journal.ppat.1000160.g005 Figure 5 M. tuberculosis evolutionary scenario (out of Mesopotamia). The main migrations events are numbered and correspond to: 1, M. prototuberculosis, the ancestor of the MTBC, this bacterium reached the Fertile Crescent some 40,000 years ago by sea or land; 2 and 3, two distinct basal lineages arose, EAI and LAM and spread out of Mesopotamia some 10, 000 years ago; 4, 5 and 6, later on (8–5000 years ago) derived populations from clade 1 followed main human migration patterns to Africa, Asia and Europe, giving rise to locally adapted tubercle strains and further diversifications. Note that the depicted borders are “artificial” and are used for the demonstration. Global movements and intercontinental exchanges tend to blur this phylogenetic signal though strong enough to be detected nowadays. All MTBC populations from human sources displayed markedly constant expansion rates, corresponding to an expansion that dates back to only about 180 years. Furthermore, the largest population size increase (500-fold) was detected for Beijing, which is thought to be the most successful present day lineage. These results suggest that the expansion of the most recent form of human tuberculosis was coupled to Western urbanization and industrialization. This expansion was synchronous with the modern demographic explosion of Homo sapiens and modern intercontinental movements. Evidence for strong phylogeographical structuring of the pathogen population and preferential sympatric combinations of pathogen populations with particular ethnic groups has indicated a close association between M. tuberculosis and its human host [3],[13],[37]. Our results indicate recent parallel demographic changes between the pathogen and its host and reveal the tell-tale dynamic dimension of this association. The coalescence approach may also be useful in the future to monitor demographic changes in emerging MDR M. tuberculosis strains. Some of the conclusions presented here on the basis of MIRU data have also been reached previously, e.g. data from comparative genomics [12] after the completion of M. bovis genome [31] indicated that the MTBC did not arise as a zoonosis [38]. In contrast, the validity of efforts to date the origins of the common ancestor of MTBC by using SNP-based methods [39],[40],[41], has remained questionable [42]. Furthermore, preliminary SNP-based phylogenetic reconstructions may have been affected by hitch-hiking, and ascertainment bias [43], because those SNPs were associated with genes involved in drug-resistance [44] or were selected from a non-representative set of available genomes [14],[17],[45]. Such markers evolve too slowly for recent pathogens, as is also the case for LSPs and their use often results in uninformative phylogenies that consist of multifurcated unresolved trees [13],[44]. Unlike previous studies, the novel analyses presented here rely on globally neutral markers with mutation rates that have been estimated from human M. tuberculosis infection cases, a descent-sampling scheme and multiple, convergent population genetic estimators. As they are based on intrinsically rare and stochastic VNTR changes in clonal populations, our mutation rate estimates do involve some special assumptions. The accuracy of the demographic and temporal estimates could be improved with long-term analyses, and we are aware that the use of a mean mutation rate for all loci is suboptimal, leading to an increase of the variance of parameters. However, our estimates were consistently corroborated by posterior Bayesian calculations in independent runs over different strain populations (ranging from 10−4.19 for LAM to 10−3.82 for EAI), ruling out the risk of some local maxima. To gain further insights into the host-pathogen interactions, it would certainly be important to account for the biogeographic history and distribution of the different M. tuberculosis lineages, because recent adaptations to local host populations might play a major role [13]. Furthermore, it is known, that genetic diversity can influence the transmission dynamics of drug-resistant bacteria [2],[46], and, in terms of vaccination, it would be advisable to scrutinize independently the highly polymorphic clade 2 EAI strains that markedly differ in their genetic structure from the other human tuberculosis strains. Materials and Methods Sampling and data collection The 355 M. tuberculosis and M. prototuberculosis isolates were genotyped by multiplex PCR amplification as described previously [8],[47]. The samples were subjected to electrophoresis using ABI 3100 and 3730 automated sequencers. Sizing of the PCR fragments and assignment of the VNTR alleles of the 24 loci was done using the GeneScan and customized Genotyper, as well as the GeneMapper software packages (PE Applied Biosystems). Genetic diversity estimation The number of alleles (allelic richness) in each M. tuberculosis complex population was estimated and sample sizes were corrected by the rarefaction procedure using HP-RARE [23]. Comparison tests as well as P-values were estimated using the STATISTICA v.6.1 package. Phylogenetic inferences Nei et al.'s DA distance [48] was used to construct both isolate and population trees using a neighbour-joining algorithm as implemented in the software Populations version1.2.28. Support for the tree nodes was assessed by bootstrapping over loci (1, 000 iterations). Inferring population structure and recombination in the M. tuberculosis complex Using the no-admixture model [18] (STRUCTURE version 2), three to ten parallel Markov chains were run for all models of K with a burn-in of 100,000 iterations and a run length of 106 iterations following the burn-in. For each run, the ln likelihood of each model was calculated. The full data set was analysed for all models from K = 1 through to 3 without specifying prior information concerning the geographical sources or former designations. For K = 3, a clear splitting solution was found in which the sampled populations clustered into two main tuberculosis groups plus the outgroup (M. prototuberculosis); a result fully consistent with the neighbour-joining population tree (Figure 1B). For further analysis the data set was sub-divided into clades 1 and 2, and these were subsequently tested for K = 1 through to 6. Using the linkage model [49] of STRUCTURE version 2, ten parallel Markov chains were run for each model with a burn-in of 100,000 iterations and a run length of 106 iterations following the burn-in. For each run, M. tuberculosis strains were specified as belonging to pre-determined source clusters. We estimated the ancestry in each source cluster and the proportion of each strain genome having ancestry in each cluster. Stepwise mutation model (SMM) and mutation rate estimates To estimate the validity of SMM model, we built a minimal spanning tree of all MTBC strains based on the degree of allele sharing, by using BIONUMERICS (Applied Maths, Belgium). We then evaluated the proportion of single-locus variants (i.e. strains that differed from their closest relative) that differed by one or by multiple repeat-changes. To further evaluate the validity of the SMM model and to detect a potential bias towards increase or decrease in repeat numbers, eBURST analysis was performed on a larger dataset from two population-based studies. The first one included 807 isolates from different TB cases notified in the Brussels-Capital Region (Belgium) from September 1st, 2002 to December 31st, 2005 [50], while the second one is an ongoing study including 1907 isolates from different TB cases notified in the Netherlands over 2004 and 2005 (Van Soolingen et al., unpublished). In total, the dataset included 1,733 MIRU-VNTR profiles, with no missing data or incomplete repeats. On this dataset, the differences in the number of repeats were calculated for each pair of ancestor/descendant genotypes along the evolutionary path inferred by eBURST analysis [51]. The occurrence of each value of repeat difference was recorded for each group (defined as groups of strains with at most one allelic mismatch with at least one other member of the group), and values were pooled over all eBURST groups. This analysis was performed using software Multilocus Analyzer (S. Brisse, unpublished), which is an independent implementation (coded in Python) of the eBURST algorithm, to which the SMM test function was added. MIRU mutation rates were estimated by using data on VNTR changes among large sets of serial or epidemiologically-linked isolates [11]. Single-locus mutation rates of 5 most variable loci were estimated from corresponding frequencies of observed repeat changes. Repeat changes among serial or epidemiologically-linked isolates were not detected among the remaining, less variable loci. Therefore, the relative frequencies of single-locus variations among closely related isolates in a global MTBC isolate dataset [11] and in the population based dataset (see above) were then used as a surrogate for estimating mutation rates of less variable markers relatively to these most variable loci. Coalescence, TMRCA and demography In a first step, we used a Bayesian approach [25] that assumes a stepwise mutation model and estimates the posterior probability distributions of the genealogical and demographic parameters of a sample using Markov chain Monte Carlo simulations based on MIRU data. This method permits to extrapolate important biological parameters like the TMRCA of a given sample in years, the past and present effective population size and the latest demographic changes (decline, constant population size or expansion). In order to assess the age of the main M. tuberculosis lineages, an alternative algorithm, YTime [24] was used to calculate the TMRCAs and their confidence intervals. For the MSVAR procedure [25],[52], we focused on lineages of which at least 30 isolates were available, in order to obtain a reliable coverage of the TMRCA and to avoid small sample size artefacts. The estimated parameters were scaled in terms of current population size, and two main demographic parameters were quantified: tf , which is a measure of time in generations, was defined as ta /N0 , where ta denotes the number of generations that have elapsed since the decline or expansion began, and r, which was defined as N0 /N1 , where N0 is the current effective number of chromosomes, and N1 is the number of chromosomes at some previous point in time tf . For a declining population r 1. The procedure also estimates θ, which is defined as N0μ, where μ is the mutation rate (mutation locus−1 generation−1). The analyses were performed assuming exponential demographic change. Three different chains were run for each analysis to confirm the convergence of the results. In the analyses, rectangular priors of the log parameter values have been used. The method was found to converge appropriately for both single and multilocus data sets and supported a model of population expansion for all MTBC populations. We present only the multilocus data in the present report. Expansion signatures were robust and were confirmed in runs where decline was assumed as a prior (10−2–10−3). YTime [24]: YTime is a Matlab function which calculates the TMRCA for haplotype linked loci under the assumption of an S-SSM, which allows for unbiased +/−1 steps. YTime calculates confidence intervals using a simulation approach and is independent of the shape of the genealogy. We used all available loci (N = 24) as an input. The strains were grouped according to their lineages (obtained by phylogenetic analyses). The ancestral genotype for every subgroup was calculated as the mean of every single locus in the particular subgroup. The mutation rate was 10−4 per year per locus. For the growth rate parameters we assumed a mean effective population size of 108 for every sub-population and a growth of 103 (the mean of the results is not affected by the growth rate, just the confidence intervals). Supporting Information Protocol S1 (0.06 MB DOC) Click here for additional data file. Table S1 List of the MTBC isolates used in this study. (0.08 MB DOC) Click here for additional data file. Figure S1 Bubble-graph representation of allele frequencies for the different MIRU loci. Allele size (number of repeats) on the y-axis, and source populations on the x-axis. (2.07 MB PDF) Click here for additional data file. Figure S2 MIRU and region of deletion (RD) patterns of 176 random selected M. tuberculosis and M. prototuberculosis strains. A visualisation of MIRU and RD data was added to the rooted population neighbour-joining tree based on genetic distances (see Figure 1B). Representative results are shown for 89 isolates. The copy numbers of the 24 MIRU loci are displayed in blue shades ranging from 0 (white) to 13 (dark blue). For RD-analysis, black and white boxes correspond respectively to presence and absence of the considered region. The deletions distribution and the spolygotype patterns (data not shown) were in good congruence with the MIRU typing. Several clusters defined by MIRU typing also showed specific deletions such as RD726 for the Cameroon lineage or RD711 for West-African 1 strains. The presence or absence of the deletions also supported the dichotomy of the tree as all clade 1 strains are TBD1 negative and all clade 2 strains are TBD1 positive. However, it must be noted, that MIRU typing allowed a fin grain resolution, for example, several lineages e.g. West African 1a and West African 1b belong to two different lineages but remain undistinguishable by RD-typing. The presence or absence of the 16 Rds was determined by PCR as described previously [15],[16],[17],[18]. (1.21 MB EPS) Click here for additional data file. Figure S3 MTBC population structure. (A) Population structure of 355 M. tuberculosis and M. prototuberculosis isolates. Each strain is represented by a single vertical line divided into K colours, where K is the number of clusters assumed. Each colour represents one cluster, and the length of the coloured segment shows the strain's estimated proportion of membership in that cluster. Black lines separate the main lineages. (B) Population structure for K = 3, as in a, but with the implementation of the linkage model. (7.91 MB EPS) Click here for additional data file. Figure S4 Evolution of repeat copy number among MIRU-VNTR single-locus variants. To evaluate the validity of the stepwise mutation model and to detect a potential bias towards increase or decrease in repeat numbers, EBURST analysis was performed on a large dataset comprising a total of 2714 isolates from two population-based studies. Genotypes from a selected clonal complex are represented as circles. Stepwise and non-stepwise allelic changes between genotypes, along with corresponding marker number, are highlighted in green and gray, respectively. Insets show examples of allelic identification by analysis of marker amplicons using GENEMAPPER. Gray ladders and axis in insets define amplicon size bins expected for MIRU-VNTR alleles and measured amplicon sizes in base pairs, respectively. Code numbers in the upper left of insets define sample and marker identity, respectively. M, marker (from 1 to 24). (2.07 MB EPS) Click here for additional data file. Figure S5 Distribution of repeat copy number changes among MIRU-VNTR single-locus variants. The difference in the number of repeats was calculated for each pair of ancestor/descendant genotypes along the evolutionary path inferred by EBURST analysis, on a large dataset comprising a total of 2714 isolates from two population-based studies. The occurrence and nature of each repeat difference was recorded for each strain group (defined as groups of strains with at most one allelic mismatch with at least one other member of the group), and values were pooled over all EBURST groups. (0.96 MB EPS) Click here for additional data file.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Ancient DNA from European Early Neolithic Farmers Reveals Their Near Eastern Affinities

            Introduction The transition from a hunter–gatherer existence to a “Neolithic lifestyle,” which was characterized by increasing sedentarism and the domestication of animals and plants, has profoundly altered human societies around the world [1],[2]. In Europe, archaeological and population genetic views of the spread of this event from the Near East have traditionally been divided into two contrasting positions. Most researchers have interpreted the Neolithic transition as a period of substantial demographic flux (demic diffusion) potentially involving large-scale expansions of farming populations from the Near East, which are expected to have left a detectable genetic footprint [3],[4]. The alternative view (cultural diffusion model; e.g., [5]) suggests that indigenous Mesolithic hunter–gatherer groups instead adopted new subsistence strategies with relatively little, or no, genetic influence from groups originating in the Near East. Genetic studies using mitochondrial DNA (mtDNA) and Y-chromosomal data from modern populations have generated contradictory results, and as a consequence, the extent of the Neolithic contribution to the gene pool of modern-day Europeans is still actively debated [6]–[8]. Studies that suggest that the genetic variation in modern-day Europe largely reflects farming communities of the Early Neolithic period [9]–[11] contrast strongly with others that consider the input from the Near East an event of minor importance and ascribe the European genetic variation and its distribution patterns to the initial peopling of Europe by anatomically modern humans in the Upper Paleolithic [12]–[15]. These patterns are also likely to have been significantly impacted by the early Holocene re-expansions of populations out of southerly refugia formed during the Last Glacial Maximum (∼25,000 y ago) and by the numerous demographic events that have taken place in post-Neolithic Europe. The genetics of prehistoric populations in Europe remain poorly understood, restricting real-time insights into the process of the Neolithic transition [16]–[21]. As a result, most attempts to reconstruct history have been limited to extrapolation from allele frequencies and/or coalescent ages of mitochondrial and Y chromosome haplogroups (hgs) in modern populations. Ancient DNA (aDNA) analyses now provide a powerful new means to directly investigate the genetic patterns of the early Neolithic period, although contamination of specimens with modern DNA remains a major methodical challenge [22]. A previous genetic study of 24 individuals from the early Neolithic Linear Pottery Culture (LBK; 5,500–4,900 calibrated b.c. [cal b.c.]) in Central Europe detected a high frequency of the currently rare mtDNA hg N1a, and proposed this as a characteristic genetic signature of the Early Neolithic farming population [19]. This idea was recently supported by the absence of this particular lineage (and other now more common European hgs) among sequences retrieved from neighboring Mesolithic populations [20],[21]. However, a study of 11 individuals from a Middle/Late Neolithic site on the Iberian Peninsula (3,500–3,000 cal b.c.) did not find significant differences from modern populations, supporting a quite different population genetic model for the Neolithic transition in Iberia [18]. To gain direct insight into the genetic structure of a population at the advent of farming in Central Europe we analyzed a complete graveyard from the Early Neolithic LBK site at Derenburg Meerenstieg II (Harzkreis, Saxony-Anhalt) in Germany. The archaeological culture of the LBK had its roots in the Transdanubian part of the Carpathian Basin in modern-day Hungary approximately 7,500–8,000 y ago and spread during the subsequent five centuries across a vast area ranging from the Paris Basin to the Ukraine [23],[24]. The graveyard samples provide a unique view of a local, closed population and permit comparisons with other specimens of the LBK archaeological culture (the contemporaneous meta-population) and with modern populations from the same geographical area (covering the former range of the LBK), as well as groups across the wider context of Western Eurasia. Our primary aim was to genetically characterize the LBK early farming population: by applying comprehensive phylogeographic and population genetic analyses we were able to locate its origins within the broader Eurasian region, and to trace its potential dispersal routes into Europe. Results/Discussion We used standard approaches to clone and sequence the mitochondrial hypervariable segment I (HVS-I) and applied quantitative real-time PCR (qPCR) as an additional quality control. In addition, we developed two new multiplex typing assays to simultaneously analyze important single nucleotide polymorphisms (SNPs) within the mtDNA coding region (22 SNPs: GenoCoRe22) and also the Y chromosome (25 SNPs: GenoY25). In addition to minimizing the risk of contamination, the very short DNA fragments (average 60–80 bp) required by this approach maximize the number of specimens that can be genetically typed. We successfully typed 17 individuals for mtDNA, which together with a previous study [19] provided data for 22 individuals from the Derenburg graveyard (71% of all samples collected for genetic analysis; Tables 1 and S1), and significantly extended the genetic dataset of the LBK (n = 42), to our knowledge the largest Neolithic database available. Sequences have been deposited in GenBank (http://www.ncbi.nlm.nih.gov/genbank/; accession numbers HM009339–HM009341, HM009343–HM009355, and HM009358), and detailed alignments of all HVS-I clone sequences from Derenburg are shown in Dataset S1. 10.1371/journal.pbio.1000536.t001 Table 1 Summary of archaeological, genetic, and radiocarbon data. Sample Feature Grave Age, Sexa Radiocarbon Date (Laboratory Code) (Uncalibrated BP, Cal b.c.) [73] HVS-I Sequence (np 15997–16409), Minus np 16000 Hg HVS-I Hg GenoCoRe22 Hg GenoY25 deb09 420 9 Adult, f rCRS H H deb06 421 10 Adult/mature, n.d. Ambiguous n.d. H — deb11 569 16 Adult, f? n.d. n.d. T deb10 566 17 Adult, m 093C, 224C, 311C K K — deb23 565 18 Infans I, m? 093C, 223T, 292T W W — deb12I 568 20 Infans I, m? 6,015±35 BP (KIA30400), 4,910±50 cal b.c. 298C V V — deb03 591 21 Adult, f 6,147±32 BP (KIA30401), 5,117±69 cal b.c. 147A, 172C, 223T, 248T, 320T, 355T N1a n.d. deb15 593 23 Infans I, f? 126C, 294T, 296T, 304C T2 T — deb05 604/2 29 Infans II, f?? 311C HV HVb deb22 604/3 30 Adult/mature, f 092C, 129A, 147A, 154C, 172C, 223T, 248T, 320T, 355T N1a N1 — deb20 599 31 Adult, m 6,257±40 BP (KIA30403), 5,247±45 cal b.c. 311C HV HV F*(xG,H,I,J,K) deb21 600 32 Mature, f 6,151±27 BP (KIA30404), 5,122±65 cal b.c. rCRS H H deb01 598 33 Infans II/Juvenile, f?? 147A, 172C, 223T, 248T, 355T N1a N1 deb04 596 34 Adult, m 6,141±33 BP (KIA30402), 5,112±73 cal b.c. 311C HV HVb deb26 606 37 Juvenile, m?? 069T, 126C J J — deb32 640 38 Adult/mature f 6,142±34 BP (KIA30405), 5,112±73 cal b.c. n.d. n.d. T deb30 592 40 Adult, f? 069T, 126C J J — deb29II 649 41 Adult, f? 6,068±31 BP (KIA30406), 4,982±38 cal b.c. n.d. n.d. K deb34II 484 42 Adult/mature, m 093C, 223T, 292T W W G2a3 deb33 483 43 Juvenile II, f?? 126C, 147T, 293G, 294T, 296T, 297C, 304C T2 T — deb02 644 44 Mature, f 224C, 311C K K — deb36 645 45 Mature, f 093C, 256T, 270T, 399G U5a1a U deb38 665 46 Adult/mature, m 093C, 224C, 311C K K F*(xG,H,I,J,K) deb35II 662 47 Adult, f? 126C, 189C, 294T, 296T T T deb37I 643 48 Adult/mature f 069T, 126C J J deb39 708 49 Adult/mature, f 6,148±33 BP (KIA30407), 5,117±69 cal b.c. 126C, 294T, 296T, 304C T2 T — Italicized samples had been described previously [19]. a One versus two question marks after sex indicate two levels of insecurity in sexing. b Previously analyzed diagnostic SNP sites at np 7028 AluI (hg H) and np 12308 HinfI (hg U) per restriction fragment length polymorphism. BP, before present; f, female; m, male; n.d., not determined. Multiplex SNP Typing Assays All of the mtDNA SNP typing results were concordant with the hg assignments based on HVS-I sequence information (Tables 1 and S1) and the known phylogenetic framework for the SNPs determined from modern populations [25]. The tight hierarchical structure of the latter provides a powerful internal control for contamination or erroneous results. Overall, both multiplex systems proved to be extremely time- and cost-efficient compared to the standard approach of numerous individual PCRs, and required 22–25 times less aDNA template while simultaneously reducing the chances of contamination dramatically. Also, both multiplex assays proved to be a powerful tool for analyzing highly degraded aDNA, and the GenoCoRe22 assay was able to unambiguously type four additional specimens that had failed to amplify more than 100 bp (Table 1) from two independent extractions. However, for reasons of overall data comparability, we could not include these specimens in downstream population genetic analyses, which required HVS-I sequence data. The only artifacts detected were occasional peaks in the electropherograms of the SNaPshot reactions outside the bin range of expected signals. These were probably due to primers and were mainly present in reactions from extracts with very little or no DNA template molecules; they were not observed with better preserved samples or modern controls. In contrast, Y chromosome SNPs could be typed for only three out of the eight male individuals (37.5%; Table S2) identified through physical anthropological examination, reflecting the much lower copy number of nuclear loci [22]. After typing with the GenoY25 assay, individual deb34 was found to belong to hg G (M201), whereas individuals deb20 and deb38 both fall basally on the F branch (derived for M89 but ancestral for markers M201, M170, M304, and M9), i.e., they could be either F or H (Table 1). To further investigate the hg status beyond the standard GenoY25 assay, we amplified short fragments around SNP sites M285, P287, and S126 to further resolve deb34 into G1, G2*, and G2a3, and around SNP site M69 to distinguish between F and H [26]. deb34 proved to be ancestral for G1-M285 but derived for G2*-P287 and additional downstream SNP S126 (L30), placing it into G2a3. deb20 and deb38 were shown to be ancestral at M69 and hence basal F (M89), and remained in this position because we did not carry out further internal subtyping within the F clade. The multiplexed single base extension (SBE) approach with its shortened flanking regions around targeted SNPs significantly increases the chance of successful Y-chromosomal amplifications, which have remained problematic for aDNA studies, as have nuclear loci in general, because of the much lower cellular copy number compared to mitochondrial loci. The multiplexed SBE approach promises to open the way to studying the paternal history of past populations, which is of paramount importance in determining how the social organization of prehistoric societies impacted the population dynamics of the past. Quantitative Real-Time PCR Results of the qPCR revealed significantly (p = 0.012, Wilcoxon signed-ranks test) more mtDNA copies per microliter of each extract for the shorter fragment (141 bp) than for the longer (179 bp), with an average 3.7×104–fold increase (detailed results are shown in Table S3). This finding is consistent with previous observations demonstrating a biased size distribution for authentic aDNA molecules [22],[27],[28] and suggests that any contaminating molecules, which would also result in higher copy numbers in the larger size class, did not significantly contribute to our amplifications. Population Genetic Analyses To analyze the Neolithic mtDNA sequence diversity and characterize modern geographical affinities, we applied a range of population genetic analyses including shared haplotype analyses, principal component analyses (PCAs), multidimensional scaling (MDS), geographic mapping of genetic distances, and demographic modeling via Bayesian Serial Simcoal (BayeSSC) analyses (Table 2). 10.1371/journal.pbio.1000536.t002 Table 2 Summary statistics, overview of population genetic analyses, and summary of haplogroup frequencies used for comparison with PCA vector loadings. Category Variable, Simulation, or Hg Modern Datasets Ancient Datasetsa Total Dataset Pooled Geographic Sets of Equal Size (n = ∼500) Pooled European Dataset Pooled Near East Populations DEB22 LBK20 LBK42 LBK34 Hunter–Gatherers Summary statistics Populations 55 37 41 14 1 1 1 1 1 Samples 23,394 18,039 22 20 42 34 20 Population genetic analysis & simulations Shared haplotypes X X PCA X X X X X X Relative hg frequencies X X X X X X X MDS X X X X X Genetic distance maps X X X BayeSSC Xb Xb X X Haplotype diversity h 0.957 0.989 0.969 0.982 0.932 Tajima's D −0.91645 −0.90573 −0.91374 −0.88555 −1.05761 Relative hg frequencies Asian hgs 1.62 2.09 0.00 0.00 0.00 0.00 0.00 African hgs 0.65 6.43 0.00 0.00 0.00 0.00 0.00 R0/preHV 0.37 3.26 0.00 0.00 0.00 0.00 0.00 H 43.35 23.74 13.64 25.00 19.05 17.65 0.00 HV 1.40 5.80 13.64 0.00 7.14 2.94 0.00 J 8.49 10.59 13.64 5.00 9.52 5.88 4.76 T 9.26 8.91 13.64 25.00 19.05 23.53 9.52 I 2.23 1.97 0.00 0.00 0.00 0.00 0.00 N1a 0.30 0.32 13.64 15.00 14.29 17.65 0.00 K 5.39 6.67 13.64 15.00 14.29 14.71 4.76 V 4.35 0.77 4.55 5.00 4.76 5.88 0.00 W 2.03 2.25 9.09 5.00 7.14 5.88 0.00 X 1.22 2.52 0.00 0.00 0.00 0.00 0.00 U2 1.04 1.52 0.00 0.00 0.00 0.00 0.00 U3 1.26 4.43 0.00 5.00 2.38 2.94 0.00 U4 4.04 2.10 0.00 0.00 0.00 0.00 9.52 U5a 5.46 2.53 4.55 0.00 2.38 2.94 23.80 U5b 3.89 0.64 0.00 0.00 0.00 0.00 28.57 Other rare hgs 3.67 13.45 0.00 0.00 0.00 0.00 19.05 X's indicate which datasets were used in the genetic analyses. a For explanation of datasets, see Materials and Methods. b For BayeSSC analyses, representative samples of the key areas were randomly drawn from the larger meta-population pool (Table S6). Shared Haplotype Analyses We prepared standardized modern population datasets of equal size (n = ∼500) from 36 geographical regions in Eurasia (n = 18,039; Table S4) to search for identical matches with each LBK haplotype. Out of 25 different haplotypes present in 42 LBK samples, 11 are found at high frequency in nearly all present-day populations under study, a further ten have limited geographic distribution, and the remaining four haplotypes are unique to Neolithic LBK populations (Table S4). The 11 widespread haplotypes are mainly basal (i.e., constituting a basal node within the corresponding hg) for Western Eurasian mitochondrial hgs H, HV, V, K, T, and W. While these haplotypes are relatively uninformative for identifying genetic affiliations to extant populations, this finding is consistent within an ancient population (5,500–4,900 cal b.c., i.e., prior to recent population expansions), in which basal haplotypes might be expected to be more frequent than derived haplotypes (e.g., end tips of branches within hgs). The next ten LBK haplotypes were unequally spread among present-day populations and for this reason potentially contain information about geographical affinities. We found nine modern-day population pools in which the percentage of these haplotypes is significantly higher than in other population pools (p>0.01, two-tailed z test; Figure 1; Table S4): (a) North and Central English, (b) Croatians and Slovenians, (c) Czechs and Slovaks, (d) Hungarians and Romanians, (e) Turkish, Kurds, and Armenians, (f) Iraqis, Syrians, Palestinians, and Cypriotes, (g) Caucasus (Ossetians and Georgians), (h) Southern Russians, and (i) Iranians. Three of these pools (b–d) originate near the proposed geographic center of the earliest LBK in Central Europe and presumably represent a genetic legacy from the Neolithic. However, the other matching population pools are from Near East regions (except [a] and [h]), which is consistent with this area representing the origin of the European Neolithic, an idea that is further supported by Iranians sharing the highest number of informative haplotypes with the LBK (7.2%; Table S4). The remaining pool (a) from North and Central England shares an elevated frequency of mtDNA T2 haplotypes with the LBK, but otherwise appears inconsistent with the proposed origin of the Neolithic in the Near East. It has been shown that certain alleles (here hgs) can accumulate in frequency while surfing on the wave of expansion, eventually resulting in higher frequencies relative to the proposed origin [29],[30]. Several of the other population pools also show a low but nonsignificant level of matches, which may relate to pre-Neolithic distributions or subsequent demographic movements (Figure 1). 10.1371/journal.pbio.1000536.g001 Figure 1 Percentages of shared haplotype matches per population. Populations are plotted on a northwest–southeast axis. Note that the percentage of non-informative matches (orange) is nearly identical to the percentage of all shared haplotypes (red) in most populations, whereas we observe elevated frequencies of informative matches (blue) in Southeast European and Near Eastern population pools, culminating in Iranians. Of the four unique mtDNA haplotypes, two were from an earlier study of the LBK (16286-16304 and 16319-16343; Table S5 and [19]). The haplotype 16286-16304 has many one- or two-step derivates in all parts of Europe and is therefore rather uninformative for inferring further geographical affinities. The only relatively close neighbor of haplotype 16319-16343 is found in Iraq (16129-16189-16319-16343), in agreement with the Near Eastern affinities of the informative LBK haplotypes. The other two unique LBK haplotypes belong to N1a, the characteristic LBK hg. The frequency of N1a was 13.6% for Derenburg samples (3/22) and 14.3% for all LBK samples published to date (6/42). Notably, N1a has not yet been observed in the neighboring hunter–gatherer populations of Central Europe before, during, or after the Early Neolithic [20] nor in the early Neolithic Cardial Ware Culture from Spain [18]. The Y chromosome hgs obtained from the three Derenburg early Neolithic individuals are generally concordant with the mtDNA data (Table 1). Interestingly, we do not find the most common Y chromosome hgs in modern Europe (e.g., R1b, R1a, I, and E1b1), which parallels the low frequency of the very common modern European mtDNA hg H (now at 20%–50% across Western Eurasia) in the Neolithic samples. Also, while both Neolithic Y chromosome hgs G2a3 and F* are rather rare in modern-day Europe, they have slightly higher frequencies in populations of the Near East, and the highest frequency of hg G2a is seen in the Caucasus today [15]. The few published ancient Y chromosome results from Central Europe come from late Neolithic sites and were exclusively hg R1a [31]. While speculative, we suggest this supports the idea that R1a may have spread with late Neolithic cultures from the east [31]. Principal Component Analysis and Multidimensional Scaling Four Neolithic datasets were constructed (Table 2) and compared with 55 present-day European and Near Eastern populations and one Mesolithic hunter–gatherer population [20] in a PCA (Figure 2). The PCA accounted for 39% of the total genetic variation, with the first principal component (PC) separating Near Eastern populations from Europeans (24.9%), and with LBK populations falling closer to Near Eastern ones. However, the second PC (17.4%) clearly distinguished the four Neolithic datasets from both Near East and European populations. An MDS plot (Figure S1) showed similar results, with the Near Eastern affinities of the LBK populations even more apparent. 10.1371/journal.pbio.1000536.g002 Figure 2 PCA plot based on mtDNA haplogroup frequencies. The two dimensions display 39% of the total variance. The contribution of each hg is superimposed as grey component loading vectors. Notably, the Derenburg dataset (DEB22) groups well with its meta-population (LBK20), supporting the unique status and characteristic composition of the LBK sample. Populations are abbreviated as follows (Table S6): ALB, Albanians; ARM, Armenians; ARO, Aromuns; AUT, Austrians; AZE, Azeris; BAS, Basques; BLR, Byelorussians; BOS, Bosnians; BUL, Bulgarians; CHE, Swiss; CHM, Mari; CHV, Chuvash; CRO, Croats; CZE, Czechs; DEB22, Derenburg; DEU, Germans; ENG, English; ESP, Spanish; EST, Estonians; FIN, Finns; FRA, French; GEO, Georgians; GRC, Greeks; HG, European Mesolithic hunter–gatherers.; HUN, Hungarians; IRL, Irish; IRN, Iranians; IRQ, Iraqis; ISL, Icelanders; ITA, Italians; JOR, Jordanians; KAB, Kabardinians; KAR, Karelians; KOM, Komis (Permyaks and Zyrian); KUR, Kurds; LBK20, LBK without Derenburg; LBK34, all LBK samples excluding potential relatives; LBK42, all LBK; LTU, Lithuanians; LVA, Latvians; MAR, Moroccans; MOR, Mordvinians; NOG, Nogais; NOR, Norwegians; OSS, Ossetians; POL, Poles; PRT, Portuguese; PSE, Palestinians; ROU, Romanians; RUS, Russians; SAR, Sardinians; SAU, Saudi Arabians; SCO, Scots; SIC, Sicilians; SVK, Slovaks; SVN, Slovenians; SWE, Swedes; SYR, Syrians; TAT, Tatars; TUR, Turkish; UKR, Ukrainians. To better understand which particular hgs made the Neolithic populations appear either Near Eastern or (West) European, we compared average hg frequencies of the total LBK (LBK42) and Derenburg (DEB22) datasets to two geographically pooled meta-population sets from Europe and the Near East (Tables 2 and S6; 41 and 14 populations, respectively). PC correlates and component loadings (Figure 2) showed a pattern similar to average hg frequencies (Table 2) in both large meta-population sets, with the LBK dataset grouping with Europeans because of a lack of mitochondrial African hgs (L and M1) and preHV, and elevated frequencies of hg V. In contrast, low frequencies of hg H and higher frequencies for HV, J, and U3 promoted Near Eastern resemblances. Removal of individuals with shared haplotypes within the Derenburg dataset (yielding dataset LBK34) did not noticeably decrease the elevated frequencies of J and especially HV in the Neolithic data. Most importantly, PC correlates of the second component showed that elevated or high frequencies of hgs T, N1a, K, and W were unique to LBK populations, making them appear different from both Europe and Near East. The considerable within-hg diversity of all four of these hgs (especially T and N1a; Table 1) suggests that this observation is unlikely to be an artifact of random genetic drift leading to elevated frequencies in small, isolated populations. The pooled European and Near Eastern meta-populations are necessarily overgeneralizations, and there are likely to be subsets of Near Eastern populations that are more similar to the Neolithic population. Interestingly, both the PCA and the MDS plots identified Georgians, Ossetians, and Armenians as candidate populations (Figures 2 and S1). Mapping Genetic Distances We generated genetic distance maps to visualize the similarity/distance of the LBK and Derenburg populations (datasets LBK42 and DEB22) to all modern populations in the large Western Eurasian dataset (Figure 3). In agreement with the PCA and MDS analyses, populations from the area bounding modern-day Turkey, Armenia, Iraq, and Iran demonstrated a clear genetic similarity with the LBK population (Figure 3A). This relationship was even stronger in a second map generated with just the Neolithic Derenburg individuals (Figure 3B). Interestingly, the map of the combined LBK data also suggested a possible geographic route for the dispersal of Neolithic lineages into Central Europe: genetic distances gradually increase from eastern Anatolia westward across the Balkans, and then northwards into Central Europe. The area with lower genetic distances follows the course of the rivers Danube and Dniester, and this natural corridor has been widely accepted as the most likely inland route towards the Carpathian basin as well as the fertile Loess plains further northwest [23],[32],[33]. 10.1371/journal.pbio.1000536.g003 Figure 3 Genetic matrilineal distances between 55 modern Western Eurasian populations (Table S6) and Neolithic LBK samples. Mapped genetic distances are illustrated between 55 modern Western Eurasian populations and the total of 42 Neolithic LBK samples (A) or the single graveyard of Derenburg (B). Black dots denote the location of modern-day populations used in the analysis. The coloring indicates the degree of similarity of the modern local population(s) with the Neolithic sample set: short distances (greatest similarity) are marked by dark green and long distances (greatest dissimilarity) by orange, with fainter colors in between the extremes. Note that green intervals are scaled by genetic distance values of 0.02, with increasingly larger intervals towards the “orange” end of the scale. Bayesian Serial Simcoal Analysis While an apparent affinity of Neolithic farmers to modern-day Near East populations is revealed by the shared haplotype analyses, PCA, MDS, and genetic distance maps, the population-specific pairwise F ST values among ancient populations (hunter–gatherers and LBK) and the modern population pools (Central Europe and Near East) tested were all significant (p>0.05; Table 3), suggesting a degree of genetic discontinuity between ancient and modern-day populations. The early farmers were closer to the modern Near Eastern pool (F ST = 0.03019) than hunter–gatherers were (F ST = 0.04192), while both ancient populations showed similar differences to modern Central Europe, with the hunter–gatherers slightly closer (F ST = 0.03445) than the early farmers (F ST = 0.03958). The most striking difference was seen between Mesolithic hunter–gatherers and the LBK population itself (F ST = 0.09298), as previously shown [20]. We used BayeSSC analyses to test whether the observed F ST values can be explained by the effects of drift or migration under different demographic scenarios (Figure S2). This encompassed comparing F ST values derived from coalescent simulations under a series of demographic models with the observed F ST values in order to test which model was the most likely, given the data. By using an approximate Bayesian computation (ABC) framework we were able to explore priors for initial starting deme sizes and dependent growth rates to maximize the credibility of the final results. The Akaike information criterion (AIC) was used to evaluate a goodness-of-fit value of the range of models in the light of the observed F ST values. In addition, a relative likelihood estimate for each of the six models given the data was calculated via Akaike weights (ω). The highest AIC values, and therefore the poorest fit, were obtained for models representing population continuity in one large Eurasian meta-population through time (Models H0a and H0b; Table 4). Of note, the goodness of fit was better with a more recent population expansion (modeled at the onset of the Neolithic in Central Europe) and hence higher exponential growth rate (H0a). The model of cultural transmission (H1), in which a Central European deme including Neolithic farmers and hunter–gatherers coalesced with a Near Eastern deme in the Early Upper Paleolithic (1,500 generations, or ∼37,500 y ago), resulted in intermediate goodness-of-fit values (H1a and H1b; Table 4; Figure S2). The best goodness-of-fit values were retrieved for models of demic diffusion (model H2; Table 4) with differing proportions of migrants (25%, 50%, and 75% were tested) from the Near Eastern deme into the Central European deme around the time of the LBK (290 generations, ∼7,250 y ago; Table 4). Notably, the models testing 50% and 75% migrants returned the highest relative likelihood values (42% and 52%, respectively), and therefore warrant further investigation. However, while the demic diffusion model H2 produced values that approximated the observed F ST between Neolithic farmers and the Near Eastern population pool, none of the models could account for the high F ST between hunter–gatherers and early farmers or early farmers and modern-day Central Europeans. 10.1371/journal.pbio.1000536.t003 Table 3 Pairwise F ST values between ancient and modern-day population pools as used for goodness-of-fit estimates in BayeSSC analyses. Hunter–Gatherers Near East LBK Central Europe Hunter–Gatherers 0 — — — Near East 0.04192 0 — — LBK 0.09298 0.03019 0 — Central Europe 0.03445 0.00939 0.03958 0 10.1371/journal.pbio.1000536.t004 Table 4 Details of the demographic models analyzed with BayeSSC and AIC goodness-of-fit estimates, and resulting model probabilities via Akaike weights. Model H0a H0b H1 H2 H2 H2 Prior N e, time 0 ,deme 0 Ua:100000,30000000 U:100000,30000000 U:100000,12000000 U:100000,12000000 U:100000,12000000 U:100000,12000000 Prior N e, time 0, deme 1 U:100000,12000000 U:100000,12000000 U:100000,12000000 U:100000,12000000 Percent migrants from deme 0 to deme 1 25% 50% 75% AIC 97.78 120.37 89.19 82.56 78.52 78.07 Akaike weight ω 2.76164e−5 3.42478e−10 0.002018032 0.055596369 0.418527622 0.52383036 Of note, the smaller the AIC value, the better the fit of the model. While no threshold value can be assigned to AIC values at which any model can be rejected, the Akaike weights estimate a model probability given the six models tested. a U, uniform distribution of given range. N e, effective population size. The models we tested represent major oversimplifications and it should be noted that modeling human demographic history is notoriously difficult, especially given the complex history of Europe and the Near East over this time scale. The fact that no model explained the observed F ST between ancient and modern-day populations particularly well suggests that the correct scenario has not yet been identified, and that there is also an obvious need for sampling of material from younger epochs. Additionally, sampling bias remains an issue in aDNA studies, and this is particularly true for the chronologically and geographically diverse hunter–gatherer dataset. In the light of the models tested (see also [19],[20]), we would suggest that the basis of modern European mtDNA diversity was formed from the postglacial re-peopling of Europe (represented here by the Mesolithic hunter–gatherers) and the genetic input from the Near East during the Neolithic, but that demographic processes after the early Neolithic have contributed substantially to shaping Europe's contemporary genetic make up. Synthesis of Population Genetic Analyses The aDNA data from a range of Mesolithic hunter–gatherer samples from regions neighboring the LBK area have been shown to be surprisingly homogenous across space and time, with an mtDNA composition almost exclusively of hg U (∼80%), particularly hg U4 and U5, which is clearly different from the LBK dataset as well as the modern European diversity (Table 2) [20]. The observation that hgs U4 and U5 are virtually absent in the LBK population (1/42 samples) is striking (Table 2). Given this clear difference in the mtDNA hg composition, it is not surprising that the pairwise F ST between hunter–gatherers and the LBK population is the highest observed (0.09298) when we compared ancient populations with representative population pools from Central Europe and the Near East (Table 3; see also [20]). If the Mesolithic data are a genuine proxy for populations in Central Europe at the onset of the LBK, it implies that the Mesolithic and LBK groups had clearly different origins, with the former potentially representing the pre-Neolithic indigenous groups who survived the Last Glacial Maximum in southern European refugia. In contrast, our population genetic analyses confirm that the LBK shares an affinity with modern-day Near East and Anatolia populations. Furthermore, the large number of basal lineages within the LBK, a reasonably high hg and haplotype diversity generated through one- or two-step derivative lineages, and the negative Tajima's D values (Tables 1 and 2) indicate a recent expansion. These combined data are compatible with a model of Central Europe in the early Neolithic of indigenous populations plus significant inputs from expanding populations in the Near East [4],[12],[34]. Overall, the mtDNA hg composition of the LBK would suggest that the input of Neolithic farming cultures (LBK) to modern European genetic variation was much higher than that of Mesolithic populations, although it is important to note that the unique characteristics of the LBK sample imply that further significant genetic changes took place in Europe after the early Neolithic. aDNA data offers a powerful new means to test evolutionary models and assumptions. The European lineage with the oldest coalescent age, U5, has indeed been found to prevail in the indigenous hunter–gatherers [12],[35]. However, mtDNA hgs J2a1a and T1, which because of their younger coalescence ages have been suggested to be Neolithic immigrant lineages [8],[12], are so far absent from the samples of early farmers in Central Europe. Similarly, older coalescence ages were used to support hgs K, T2, H, and V as “postglacial/Mesolithic lineages,” and yet these have been revealed to be common only in Neolithic samples. The recent use of whole mitochondrial genomes and the refinement of mutation rate estimates have resulted in a general reduction in coalescence ages [8], which would lead to an improved fit with the aDNA data. However we advise caution in directly relating coalescence ages of specific hgs to evolutionary or prehistoric demographic events [36]. Significant temporal offsets can be caused by either observational bias (the delay between the actual split of a lineage and the eventual fixation and dissemination of this lineage) or calculation bias (incorrect coalescent age estimation). aDNA has considerable value not only for directly analyzing the presence or absence of lineages at points in the past but also for refining mutation rate estimates by providing internal calibration points [37]. Archaeological and anthropological research has produced a variety of models for the dispersal of the Neolithic agricultural system (“process of Neolithization”) into and throughout Europe (e.g., [1],[2],[38]). Our findings are consistent with models that argue that the cultural connection of the LBK to its proposed origin in modern-day Hungary, and reaching beyond the Carpathian Basin [23],[32],[38],[39], should also be reflected in a genetic relationship (e.g., shared haplotype analyses; Table S4). Therefore at a large scale, a demic diffusion model of genetic input from the Near East into Central Europe is the best match for our observations. It is notable that recent anthropological research has come to similar conclusions [40],[41]. On a regional scale, “leap-frog” or “individual pioneer” colonization models, where early farmers initially target the economically favorable Loess plains in Central Europe [33],[42], would explain both the relative speed of the LBK expansion and the clear genetic Near Eastern connections still seen in these pioneer settlements, although the resolving power of the genetic data is currently unable to test the subtleties of these models. In conclusion, the new LBK dataset provides the most detailed and direct genetic portrait of the Neolithic transition in Central Europe; analysis of this dataset reveals a clear demonstration of Near Eastern and Anatolian affinities and argues for a much higher genetic input from these regions, while also identifying characteristic differences from all extant (meta-)populations studied. Ancient genetic data from adjacent geographic regions and time periods, and especially from the Near East and Anatolia, will be needed to more accurately describe the changing genetic landscape during and after the Neolithic, and the new multiplexed SBE assays offer a powerful means to access this information. Materials and Methods Archaeological Background The archaeological site Derenburg Meerenstieg II (Harzkreis, Saxony-Anhalt, Germany) was excavated during three campaigns in 1997–1999 comprising an area of 3 ha. The archaeological context at this site shows a record of settlement activity ranging from the Early Neolithic (LBK) and Middle Neolithic (Rössen and Ammensleben cultures) to Bronze and Iron Age [43]. However, the main features of Derenburg are the LBK graveyard and its associated partial settlement approximately 70 m southwest. The archaeological data revealed that the larger part of the settlement has not yet been excavated and lies outside the area covered during these campaigns. In contrast, the graveyard was recorded in its entire dimension (25×30 m) and encompassed a total of 41 graves. Two separate graves were found outside the graveyard (50 m WSW and 95 m SSE). Erosion and modern agricultural ploughing might have led to a loss of some graves at the plateau area. Here, the graves were shallow and in average state of preservation, whereas the graves embedded in deeper Loess layers showed an excellent state of preservation. In total, 32 single grave burials were found; there were also one double burial, one triple burial, two burials in settlement pits, two or three times additional singular bones in a grave, three burials with a secondary inhumation, and one empty grave. The majority of individuals (75%) at Derenburg were buried in East–West orientation in a varying flexed position. The duration of usage of the graveyard spans over the entire time frame of the LBK and is reflected by the typology of the ceramics and associated grave goods ranging from older LBK pottery (Flomborn style) to youngest LBK pottery. Absolute radiocarbon dates confirm the usage over three centuries (5,200–4,900 cal b.c.; see also Table 1 and [44]). Ancient DNA Work From an initial 43 graves in the Derenburg graveyard, 31 indicated morphological preservation suitable for sampling and aDNA analyses. Five individuals had already been sampled in 2003 for our previous study and showed excellent preservation of aDNA, a negligible level of contamination, and an unusual mtDNA hg distribution, thereby justifying further investigation [19]. Hence, 26 additional individuals were processed in this study (Table 1). We amplified, cloned, and sequenced mitochondrial HVS-I (nucleotide positions [np] 15997–16409; nucleotide position according to [45]) as described previously [19]. mtDNA hg assignments were further supported by typing with a newly developed multiplex of 22 mtDNA coding region SNPs (GenoCoRe22). In addition, we typed 25 Y chromosome SNPs using a second novel multiplex assay (GenoY25). Final refinement of Y chromosome hg assignments was performed via singleplex PCRs. Lastly, the amount of starting DNA template molecules was monitored using qPCR on seven random samples (Table S3). aDNA work was performed in specialized aDNA facilities at the Johannes Gutenberg University of Mainz and the Australian Centre for Ancient DNA (ACAD) at the University of Adelaide according to appropriate criteria. All DNA extractions as well as amplification, cloning, and sequencing of the mitochondrial control region HVS-I were carried out in the Johannes Gutenberg University of Mainz facilities. Additional singleplex, all multiplex, and quantitative real-time amplifications, SNP typing, and direct sequencing of Y chromosome SNPs were carried at the ACAD as described below. SNP Selection and Multiplex Design The technique of SNP typing via SBE reactions (also known as minisequencing) has proven a reliable and robust method for high throughput analyses of polymorphisms, e.g., human mitochondrial variation [46], human X- and Y-chromosomal SNPs [47],[48], and human autosomal SNPs [49]. However, few SBE studies have addressed the special need for very short amplicon sizes to allow amplification from highly degraded DNA, as even forensic protocols have generally targeted relatively long amplicon sizes [50]–[54]. Our first multiplex (GenoCoRe22) was designed to type a panel of 22 mitochondrial coding region SNPs that are routinely typed within the Genographic Project [25], to allow for future maximum comparability with modern population data. A second multiplex (GenoY25) targeted a basal, but global, coverage of 25 commonly typed Y chromosome SNPs, for maximum comparability of paternal lineages. The aim of the SNP assay design was to produce highly efficient and sensitive protocols, capable of working on highly degraded DNA, that also allow modern human DNA contamination to be detected at very low levels and monitored [51]. The GenoCoRe22 SNP panel was chosen to cover the basal branches of mitochondrial hgs across modern human mtDNA diversity [25]. The chosen SNP sites were identical to the initial set (Figure 4 in [25]) except for hg W (SNP at np 8994 instead of np 1243) and hg R9 (SNP at np 13928 instead of np 3970), as a compromise arising from primer design within a multiplex assay. Selection of GenoY25 SNP panel for incorporation into the multiplex assay was performed using the highly resolved Y Chromosome Consortium tree and an extensive literature search for corresponding SNP allele frequencies in European populations [13],[26],[55]. Multiplex PCR Assays GenoCoRe22 and GenoY25 Multiplex assays were set up, established, and performed at the ACAD facilities. Multiplex PCR using Amplitaq Gold (Applied Biosystems) was conducted in 25-µl volumes using 1× Buffer Gold, 6 mM (GenoCoRe22) or 8 mM (GenoY25) MgCl2, 0.5 mM dNTPs (Invitrogen), ≤0.2 µM of each primer, 1 mg/ml RSA (Sigma), 2 U of Amplitaq Gold Polymerase, and 2 µl of DNA extract. Thermocycling conditions consisted of an initial enzyme activation at 95°C for 6 min, followed by 40–45 cycles of denaturation at 95°C for 30 s, annealing at 60°C (GenoCoRe22) or 59°C (GenoY25) for 30 s, and elongation at 65°C for 30 s, with a single final extension time at 65°C for 10 min. Each PCR included extraction blanks as well as a minimum of two PCR negatives at a ratio of 5∶1. PCRs were visually checked by electrophoresis on 3.5% agarose TBE gels. PCR products were purified by mixing 5 µl of PCR product with 1 U of SAP and 0.8 U of ExoI and incubating at 37°C for 40 min, followed by heat inactivation at 80°C for 10 min. Because of the sensitivity of the multiplex PCR (using fragment lengths of only 60–85 bp), and to be able to monitor potential human background contamination, usually all controls were included in downstream fragment analysis. Multiplex primer sequences and concentration are given in Table S7. SNaPshot Typing SBE reactions were carried out on the GenoCoRe22 and GenoY25 SNP multiplex assay using the ABI Prism SNaPshot multiplex reaction kit (Applied Biosystems) following the manufacturer's instructions, except that 10% 3 M ammonium sulfate was added to the extension primer mix to minimize artifacts [56]. SBE primers and concentrations are given in Table S7. Cycling conditions consisted of 35 cycles of denaturation at 96°C for 10 s, annealing at 55°C for 5 s, and extension at 60°C for 30 s. SBE reactions were purified using 1 U of SAP, incubating at 37°C for 40 min, followed by heat inactivation at 80°C for 10 min. Prior to capillary electrophoresis, 2 µl of purified SNaPshot product was added to a mix of 11.5 µl of Hi-Di Formamide (Applied Biosystems) and 0.5 µl of Gene-Scan-120 LIZ size standard (Applied Biosystems). Samples were run on an ABI PRISM 3130xl Genetic Analyzer (Applied Biosystems) after a denaturation carried out according to the manufacturer's instructions using POP-6 (Applied Biosystems). Evaluation and analyses of SNaPshot typing profiles were performed using custom settings within the GeneMapper version 3.2 Software (Applied Biosystems). Y Chromosome SNP Singleplex PCRs and Sequencing Additional Y chromosome SNPs (M285, P287 S126, and M69) were tested to determine specific downstream subclades based on the initial multiplex results in order to gain further resolution. We chose appropriate SNP loci by following general criteria, trying to keep the PCR amplicon size smaller than 90 bp in size and flanking DNA sequences free from interfering polymorphisms, such as nucleotide substitutions in potential primer binding sites. We selected PCR amplification primers that have a theoretical melting temperature of around 60°C in neutral buffered solutions (pH 7–8), with monovalent cation (Na+) concentrations at 50 mM and divalent cation (Mg++) concentrations at 8 mM. All primer candidates were analyzed for primer–dimer formation, hairpin structures, and complementarities to other primers in the multiplex using Primer 3 (http://primer3.sourceforge.net/). Primer characteristics were chosen to ensure equal PCR amplification efficiency for all DNA fragments, as previously described [50]. The primers were HPLC-purified and checked for homogeneity by MALDI-TOF (Thermo). Table S7 shows the sequences and the concentrations of the amplification primers in the final multiplex PCR. Additional Y chromosome SNP singleplex PCRs were carried out in the ACAD facilities. Standard PCRs using Amplitaq Gold (Applied Biosystems) were conducted in 25-µl volumes using 1× Buffer Gold, 2.5 mM MgCl2, 0.25 mM of each dNTP (Fermentas), 400 µM of each primer (Table S7), 1 mg/ml RSA (Sigma-Aldrich), 2 U of Amplitaq Gold Polymerase, and 2 µl of DNA extract. Thermocycling conditions consisted of an initial enzyme activation at 95°C for 6 min, followed by 50 cycles of denaturation at 94°C for 30 s, annealing at 59°C for 30 s, and elongation at 72°C for 30 s, with a single final extension time at 60°C for 10 min. Each PCR reaction included extraction blanks as well as a minimum of two PCR negatives. PCR products were visualized and purified as described above and were directly sequenced in both directions using the Big Dye Terminator 3.1 Kit (Applied Biosystems) as per manufacturer's instructions. Sequencing products were purified using Cleanseq magnetic beads (Agencourt, Beckman Coulter) according to the manufacturer's protocol. Sequencing products were separated on a 3130xl Genetic Analyzer (Applied Biosystems), and the resulting sequences were edited and aligned relative to the SNP reference sequence (GenBank SNP accession numbers: M285, rs13447378; P287, rs4116820; S126 [also known as L30], rs34134567; and M69, rs2032673) using the software Sequencher 4.7 (Genecodes). Quantitative Real-Time PCR qPCR was used to determine the amount of DNA in the samples prior to amplification and to assess the authenticity based on the assumption that there is an inverse relationship between DNA quantity and fragment length for degraded aDNA [57],[58]. Two different length fragments were amplified from the HVS-I: 141 bp (L16117/H16218) and 179 bp (L16209/H16348) [19],[59]. All qPCR reactions were carried out in a 10-µl reaction volume containing 1× Express SYBR Green ER Supermix Universal (Invitrogen), rabbit serum albumin (10 mg/ml), forward and reverse primers (10 µM), and 1 µl of DNA extract. Thermocycling conditions consisted of an initial enzyme activation at 95°C for 5 min, followed by 50 cycles of 94°C for 10 s, 58°C for 20 s, and 72°C for 15 s. The primer specificity was assessed using a post-PCR melt curve to visualize the dissociation kinetics. The primers were validated using modern DNA, and a single peak was observed for both fragments, indicating specific binding. The dissociation temperature (T M) was 80–80.3°C for the 141-bp fragment and 81.7–82.3°C for the 179-bp fragment. Both primer pairs showed an absence of primer dimers, indicated by the lack of a smaller peak on the melt curve (≈60°C) and a single band on a 2% agarose gel. The starting quantity of DNA in the ancient samples was determined by comparison to a standard curve of a known amount of DNA. The standard curves for the two fragments were created from modern human DNA. The DNA was extracted from a buccal cheek swab of a single individual using DNeasy Blood and Tissue Kit (Qiagen). mtDNA was amplified for the two fragments (141 bp and 179 bp) using 1× Hotmaster Buffer (Eppendorf), 0.5 U of Hotmaster Taq (5Prime), forward and reverse primers (10 µM), distilled water, and 2 µl of DNA extract. Thermocycling conditions consisted of an initial enzyme activation at 94°C for 2 min, followed by 30 cycles of 94°C for 20 s, 60°C for 10 s, and 65°C for 1 min. The PCR products were purified using Agencourt Ampure (Beckman Coulter) according to manufacturer's instructions. The DNA concentration for the 141-bp and 179-bp amplicons was measured twice at 1∶1 and 1∶10 dilutions with a Nanovue (GE Healthcare). Ten-fold serial dilutions, from 1×106 to 10 copies/µl, of the purified fragments were used to make the standards. These were run with the qPCR conditions described above. For each standard, each 10-fold dilution was run in triplicate and the qPCR was repeated on a separate day. All the standards met the following criteria: (1) there was a linear regression relationship between DNA quantity and cycle threshold (fluorescence above background), R 2>0.95, and (2) the reaction was efficient (i.e., a doubling of product per cycle in the exponential phase), between 90% and 110%. Ancient qPCRs were run in triplicate with extraction and PCR blanks, and PCR standards (positive control) run in duplicate. Amplifications were performed on Rotor-Gene 6000 and analyses on Rotor-Gene 6000 Series Software 1.7 (Corbett). The difference in mtDNA quantity between fragment lengths (141 and 179 bp) was assessed using a nonparametric version of a Student's t test, a Wilcoxon signed-ranks test. This test was selected because the data were not appropriate for a parametric test, displaying a mixture of normal (179 bp, p = 0.425) and non-normal (141 bp, p = 0.012) distributions, as determined from a Shapiro-Wilk W test, which is appropriate for testing the normality of groups with small sample sizes. Authentication Criteria In line with previous publications on aDNA and especially with criteria for working with human aDNA, it can be stated that a 100% authentication of ancient samples is virtually impossible [22],[57],[60]. However, we took all possible precautions to prevent modern contaminations, and we regard the results as authentically derived from endogenous DNA based on the following chain of evidence. (1) All samples were collected under DNA-free conditions after excavation. Samples were not washed, treated, or examined before taking DNA samples. (2) All preparation and analytical steps prior to DNA amplification were conducted in a clean room area solely dedicated to aDNA work located in a physically separated building without any modern DNA work (pre-PCR area). Amplification, cloning, and sequencing were carried out in the post-PCR lab. (3) All steps were monitored by non-template controls and by using bovid samples in parallel. (4) All individuals were sampled twice from anatomically independent regions and treated independently. At least eight independent PCR reactions were carried out (four overlapping fragments × two extractions) per individual. In case of successful amplification of all eight fragments, these were cloned and an average of eight clones per amplicons was sequenced to detect heterogeneous sequences due to DNA degradation or contamination. All replicable polymorphic sites were consistent with existing mtDNA haplotypes, ruling out post mortem DNA damage as a potential source for erroneous sequences. (5) The new multiplexes not only clearly confirm hg assignment but also provide an ideal monitoring system for ancient human DNA samples, as they directly target SNPs defining all potential contaminating lineages. (6) qPCR was carried out on a selection of samples to ensure appropriate levels of DNA quantity and to assess DNA quality. (7) Samples were collected and processed by W. H. exclusively (mtDNA hg H1, np 15997–16409: 16189C 16311C, and Y chromosome hg E1b1b1a-M78) after excavation; no other staff were involved in any of the pre-PCR steps. Eventually, all listed criteria indicating authenticity or at least the plausibility of having retrieved endogenous DNA were evaluated, together with the sample's post-excavation history [60]. Populations under Study Four partly overlapping Neolithic datasets were analyzed: the 22 Derenburg individuals (DEB22); 20 individuals from other LBK populations previously published (LBK20; Table S5 and [19]); the combined LBK dataset (LBK42); and the combined LBK dataset excluding eight individuals of possible kinship (LBK34, see below) to avoid overestimation of haplotype frequencies. These four Neolithic sets were analyzed against extant population data from the MURKA mitochondrial DNA database and integrated software, currently containing 97,523 HVS-I records from published sources, and maintained by coauthors V. Z., E. B., and O. B. of the Russian Academy of Medical Sciences. Analyses were restricted to 390 populations from Europe and the Near East (35,757 mtDNAs). For detailed analysis of shared haplotypes, we included only sequences spanning from np 16069 to np 16365 (34,258 samples, haplotype dataset). aDNA sequences were trimmed to the same length. For frequency-based analyses (PCA, MDS, and genetic distance maps), we omitted mtDNAs whose hg affiliations were ambiguous (absence of information on coding region SNPs), resulting in our final hg frequency dataset of 23,394 individuals from 228 population studies, which subsequently were pooled into 55 populations based on ethnicity, language, and/or geographical criteria as described in the original publications (see Table S6). Addressing Potential Kinship within the Derenburg Graveyard The mtDNA and Y chromosome hg results were overlaid onto the map of the graveyard to elucidate the spatial relationships within the graveyard (Figure S3). Four haplotypes were shared by two individuals each, and two haplotypes by three individuals each, while the remaining eight individuals (36.4%) showed unique haplotypes within the Derenburg graveyard. A number of shared haplotypes is not surprising in a medium sized, closed LBK graveyard where the influence of genetic drift and a certain level of biological kinship are likely. However, little positional structuring according to maternal lineages was observed. A clustering of mtDNA haplotypes H-rCRS (deb9 and deb21) and HV (deb4, deb20, and deb5) in the northwest corner of the cemetery is notable, whereas other shared haplotype “twins” or “trios” with a potential maternal relationship are spread across larger distances. However, it must be stated that there are many other factors influencing the layout of interments in a graveyard that cannot be unraveled by aDNA analyses. LBK burials commonly show a great variety of mortuary patterns or rites at the same site (e.g., burials within the settlement and burials in pits/middens), and it is therefore not clear whether individuals in the cemetery represent the norm or the exception, and how much of the initial genetic variation of the population is missing [44]. In any case, to avoid overestimation of haplotype frequencies in the LBK dataset, the eight duplicate haplotypes were excluded, and a reduced dataset (LBK34) was used in population genetic analyses alongside the complete set to account for a potential kinship effect. Haplotype Diversity and Tajima's D Haplotype diversity (h) and Tajima's D were calculated using DnaSP version 5 [61]. Shared Haplotype Analysis In order to calculate the percentage of shared haplotypes between the LBK sample and modern-day populations, we chose modern populations of equal or larger sample sizes, resulting in 36 out of 55 pooled populations with sample size n = 500 or above. Pooling was based on geographic proximity and linguistic similarity. For population studies with n>500, 500 samples were selected randomly. After pooling and random selection the dataset comprised 18,039 samples. A pivot table was created (4,140 haplotypes in rows and 36 populations in columns), and Neolithic LBK data were included. Similarity between LBK and other populations was described quantitatively in two ways: (1) indicating presence or absence (1/0), i.e., whether or not the particular Neolithic haplotype was found in a given modern population, and (2) indicating the number of hits, i.e., how many times the particular haplotype was found in a given population. The 25 different LBK sequence haplotypes were sorted into clusters of noninformative (11), informative (10), and unique (4) haplotypes (Table S4). We then calculated the relative frequency of each of the shared informative vs. noninformative LBK sequence haplotypes in each of the 36 modern-day populations (Table S4). A two-tailed z test (Excel version 12.1, Microsoft Office) was applied to determine which population pool showed a significantly higher or lower percentage of shared informative haplotypes (Table S4). Nonparametric bootstrapping of 100 replicates for each hg per population was used to generate the confidence intervals for the percentage of hgs that are shared between all matches, informative matches, and noninformative matches. Bootstrapping was performed in Excel.version 12.1. Principal Component and Multidimensional Scaling Analyses Classical and categorical PCAs and MDS were performed using the hg frequencies dataset. To avoid overpopulating graphs with 228 populations, populations were pooled into 55 groups defined by ethnicity, language, and/or geography as described in the original publications (see Table S6). To minimize statistical noise caused by very rare hgs, we considered only the following 19 hgs with average frequency above 1% in Europe and Near East: preHV, H, HV, J, T, I, N1a, K, V, W, X, U2, U3, U4, U5a, U5b, the group of African hgs (L and M1), the group of East Eurasian hgs (A, B, C, D, F, G, and Z), and the group of all other (rare) hgs. PCAs and categorical PCAs (used for the biplot graph in Figure 1, with default settings to correspond to a classical PCA) were performed and visualized using the software package SPSS Statistics 17.0. Nei's genetic distances [62] were calculated using the software program DJ, written by Yuri Seryogin (freely available at http://www.genofond.ru). The resulting distance matrix was visualized via MDS in SPSS Statistics 17.0. Mapping Genetic Distances The genetic distances from two Neolithic datasets (DEB22 and LBK42) to populations in the hg frequencies dataset (pooled into 120 populations with the average sample size n = 196 to gain a balanced geographical coverage) were calculated using the software DJ. Distances were plotted on a geographic map of Europe using the software GeneGeo written by S. K. This software is the renewed GGMAG package previously used for gene geographical studies ([63] and references therein). Bayesian Serial Simcoal Analysis We calculated population-specific pairwise genetic distances (F ST) in Arlequin version 3.5 [64], using 377-bp HVS-I sequences (np 16069–16365) assigned to one of four populations (Table S6): modern Central Europeans from the LBK core area (n = 1,030), modern Near Easterners (n = 737), LBK samples (n = 42), and hunter–gatherers (n = 20). F ST values were estimated using the Kimura two-parameter model [65] using a gamma distribution with shape parameter of 0.205 [66]. To test whether drift can account for the high F ST values between ancient and contemporary populations from Central Europe and the Near East we modeled three alternative population histories (Figure S2) using simulated coalescent analyses in the program BayeSSC [67],[68]. Under the null hypothesis (H0) we considered one large continuous Eurasian population with an effective population size ranging from 100,000 to 30 million and an exponential growth starting from a small Palaeolithic deme of 5,000 females, 300 (H0a) or 1,500 (H0b) generations ago. Hypothesis 1 (H1) assumes two exponentially growing populations, a Central European deme (100,000 to 12 million) and a Near Eastern deme (100,000 to 12 million), which coalesce 1,500 generations ago (37,500 y ago, assuming 25 y per generation) in an Early Upper Palaeolithic deme of 5,000 females and constant size. Here, ancient samples from hunter–gatherers and Neolithic farmers were included in the Central European deme; therefore, this model can be considered a test for genetic continuity of Central European lineages under a scenario of cultural diffusion/transmission. Alternatively, we modeled a contrasting (“demic diffusion”) scenario (H2), similar to H1 in structure but allowing for migration from the Near Eastern deme 290 generations ago. We tested a contribution of 25%, 50%, and 75% migrants from the Near Eastern to the Central European deme. Each model was simulated initially using BayeSSC for 100,000 genealogies and a fixed mutation rate of 7.5×10−6 per site per generation [66]. A uniform distribution was used for priors to estimate effective population sizes at time 0 for the Central European and Near Eastern demes (Table 4). To compare the simulated and observed data, five pairwise F ST values were chosen that reflect population differentiation between each of the two ancient samples and modern populations (Table 3). The simulated and observed F ST values were compared within an ABC framework [69], in which the top 1% of simulations were retained. Posterior distributions for each of the parameters with a prior were assessed. ABC was performed in R version 2.11.0 using scripts freely available at http://www.stanford.edu/group/hadlylab/ssc/index.html. To compare the goodness of fit of each model using AIC [70] given the observed data, priors were removed from the model and replaced with absolute parameter values that gave the maximum likelihood. The model was rerun in BayeSSC for 1,000 genealogies. The AIC for each model was calculated in R, and Akaike weights ω to compare the relative likelihood of each model where calculated in Excel version 12.1 [71],[72]. Supporting Information Dataset S1 Sequence alignments of the Derenburg individuals. (17.75 MB PDF) Click here for additional data file. Figure S1 Multidimensional scaling plot of genetic distances based on haplogroup frequencies (alienation = 0, 1117760; stress = 0, 1053030). Population abbreviations are consistent with Figure 1, and further population details and references are listed in Table S6. (1.05 MB TIF) Click here for additional data file. Figure S2 Demographic models and population pairwise F ST values used in BayeSSC analyses. CE1, Central European deme 1; exp, exponential; HG, hunter–gatherers; M, migrants; Ne, effective population size; NE0, Near Eastern deme 0; r, growth rate; UP, Upper Paleolithic. (3.00 MB TIF) Click here for additional data file. Figure S3 Map of the Neolithic graveyard Derenburg Meerenstieg II. (1.29 MB TIF) Click here for additional data file. Table S1 Results of mtDNA coding region SNP typing using the GenoCoRe22 assay. SNPs are detected in forward orientation (L-strand) unless stated otherwise (underlined), and SNP results are reported as typed in the SBE assay. Italicized samples were discarded from further analyses. Samples were typed twice from two independent extracts except for individuals deb1 and deb2. Empty cells indicate either allelic dropout or a relative fluorescence unit value below the threshold of 50. SNP 3594_L3'4 consistently yielded relative fluorescence unit values below 50, and was not reported. Subsequent primer mixes were adjusted for the suboptimal performance of SNP3594 (Table S7). (0.26 MB DOC) Click here for additional data file. Table S2 Results of Y chromosome SNP typing using the GenoY25 assay. SNPs are detected in forward orientation unless stated otherwise (underlined), and SNP results are reported as typed in the SBE assay. (0.21 MB DOC) Click here for additional data file. Table S3 Quantitative real-time PCR of Neolithic Samples from Derenburg. (0.02 MB XLS) Click here for additional data file. Table S4 Shared haplotype analyses. (0.08 MB XLS) Click here for additional data file. Table S5 Ancient samples from other LBK sites used for population genetics analyses [19]. (0.07 MB PDF) Click here for additional data file. Table S6 Details of Neolithic and modern-day populations used for comparison. (0.14 MB XLS) Click here for additional data file. Table S7 GenoCoRe22 and GenoY25 multiplex assay and additional Y chromosome PCR primer information. (0.24 MB XLS) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Genetic evidence for Near-Eastern origins of European cattle.

              The limited ranges of the wild progenitors of many of the primary European domestic species point to their origins further east in Anatolia or the fertile crescent. The wild ox (Bos primigenius), however, ranged widely and it is unknown whether it was domesticated within Europe as one feature of a local contribution to the farming economy. Here we examine mitochondrial DNA control-region sequence variation from 392 extant animals sampled from Europe, Africa and the Near East, and compare this with data from four extinct British wild oxen. The ancient sequences cluster tightly in a phylogenetic analysis and are clearly distinct from modern cattle. Network analysis of modern Bos taurus identifies four star-like clusters of haplotypes, with intra-cluster diversities that approximate to that expected from the time depth of domestic history. Notably, one of these clusters predominates in Europe and is one of three encountered at substantial frequency in the Near East. In contrast, African diversity is almost exclusively composed of a separate haplogroup, which is encountered only rarely elsewhere. These data provide strong support for a derived Near-Eastern origin for European cattle.
                Bookmark

                Author and article information

                Journal
                10.1016/j.tube.2015.02.021
                https://www.elsevier.com/tdm/userlicense/1.0/

                Comments

                Comment on this article