54
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Demographic History of European Populations of Arabidopsis thaliana

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The model plant species Arabidopsis thaliana is successful at colonizing land that has recently undergone human-mediated disturbance. To investigate the prehistoric spread of A. thaliana, we applied approximate Bayesian computation and explicit spatial modeling to 76 European accessions sequenced at 876 nuclear loci. We find evidence that a major migration wave occurred from east to west, affecting most of the sampled individuals. The longitudinal gradient appears to result from the plant having spread in Europe from the east ∼10,000 years ago, with a rate of westward spread of ∼0.9 km/year. This wave-of-advance model is consistent with a natural colonization from an eastern glacial refugium that overwhelmed ancient western lineages. However, the speed and time frame of the model also suggest that the migration of A. thaliana into Europe may have accompanied the spread of agriculture during the Neolithic transition.

          Author Summary

          The demographic forces that have shaped the pattern of genetic variability in the plant species Arabidopsis thaliana provide an important backdrop for the use of this model organism in understanding the genetic determinants of plant natural variation. We investigated the demographic history of A. thaliana using novel population-genetic tools applied to a combination of molecular and geographic data. We infer that A. thaliana entered Europe from the east and spread westward at a rate of ∼0.9 kilometers per year, and that its population size began increasing around 10,000 years ago. The “wave-of-advance” model suggested by these results is potentially consistent with the pattern expected if the species colonized Europe as the ice retreated at the end of the most recent glaciation. Alternatively, it is also compatible with the possibility that A. thaliana—a weedy species—may have spread into Europe with the diffusion of agriculture, providing an example of the phenomenon of “ecological imperialism” described by A. Crosby. In this framework, just as weeds from Europe invaded temperate regions worldwide during European human colonization, weeds originating from the source region of farming invaded Europe as a result of the disturbance caused by the spread of agriculture.

          Related collections

          Most cited references57

          • Record: found
          • Abstract: found
          • Article: not found

          The Pattern of Polymorphism in Arabidopsis thaliana

          Introduction The field of population genetics has always been heavily influenced by mathematical models. Ever since molecular polymorphism data started to become available, in the form of allozymes [1] or DNA sequences [2], population geneticists have searched for footprints of selection by comparing the patterns of polymorphism in particular genes with the pattern expected under standard neutral models [3,4]. Considerable intellectual effort has gone into estimating model parameters such as the mutation rate θ, the recombination rate ρ, and the effective population size, Ne [3,5]. However, because of the limited availability of data, it has been difficult to determine whether the underlying models are appropriate. For example, demographic factors such as population structure and growth can cause the genome-wide pattern of variation to deviate from standard neutral models in ways that mimic selection [4,6]. Thus, without knowing whether a standard neutral model describes the pattern of variation in most of the genome, it is difficult to conclude that a particular gene has been under selection. With the advent of high-throughput genotyping and sequencing, sufficient data for the critical appraisal of standard models are starting to become available, especially in humans [7–9]. Here we report our findings from a systematic survey of genomic DNA sequence polymorphism in Arabidopsis thaliana, one of the first in any organism. Our goal was to investigate the pattern of polymorphism in a large sample of individuals, using sufficiently densely spaced loci to obtain insight into the genome-wide haplotype structure of the species. The scale of our study allows us to describe the pattern of polymorphism with unprecedented accuracy. We begin by describing how variation is distributed, with respect to space (i.e., population structure) as well as with respect to haplotypes (i.e., linkage disequilibrium [LD]). Our set of 96 individuals contained hierarchical population samples in addition to a worldwide collection of stock center accessions (Tables S1 and S2): because of this and the large number of polymorphisms, we are able to answer a number of questions about population structure that previous studies have not been able to address. In the second part of the paper, we compare the pattern of variation to predictions made by standard population genetics models. The number of loci sequenced is sufficient to investigate the distribution of important summary statistics across the genome rather than simply looking at averages (as is usually done). Results/Discussion Sequencing A total of 876 reliable alignments, representing 0.48 Mbp of the genome (or a total over all individuals of ~44 Mbp) was generated. The average sequence length is 583 bp; the average sample size across alignments is 89. Based on the A. thaliana genome annotation, the composition of our data, which includes more than 17,000 single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms, is 15% intergenic, 55% exon, 22% intron, 4% UTR, and 5% pseudogene (see Materials and Methods). The majority of fragments, 67%, contain both coding and noncoding sequence. Population Structure Overall levels of polymorphism Our estimates of the level of polymorphism are broadly comparable to what has previously been found in A. thaliana and other species, both in terms of overall levels of polymorphism, and in the degree of constraint on different kinds of sites (Figure 1; cf. [3]). The highly selfing A. thaliana does not have unusually low levels of polymorphism: the observed values are somewhat lower than for Drosophila melanogaster, and are considerably higher than for humans. A standard way of summarizing the geographical distribution of this variation is through the statistic FST, which, loosely speaking, measures the fraction of the observed genetic variation that is due to population structure [10]. Our sample contains 40 individuals that were hierarchically sampled in pairs from four populations in each of five regions (Table S1). A hierarchical analysis of variation in these individuals reveals that 33% of the global variation is segregating among individuals within populations, 35% is segregating among local populations within regions, and 26% is segregating among regions. Only 6% of the variation in the global sample is not captured by these 40 individuals. Even though only two individuals were sampled per population, and even though our estimates of within-population variance are upwardly biased (individuals were prescreened to avoid sequencing identical individuals; see Materials and Methods), our data clearly show that individual populations harbor much of the variation present species-wide. At the same time, there is strong population structure. Global geographic structure Studies of variation in A. thaliana have typically not found any correlation between the genotype and geographic origin of accessions. This has been attributed to a recent expansion of the species, perhaps in combination with human disturbance. However, early studies had little power to detect population structure, and a more recent, larger survey revealed weak isolation by distance [11]. Our study has several orders of magnitude more markers than any previous study known to us, and we find clear global population structure. We used a model-based clustering algorithm, implemented in the computer program Structure, to cluster our accessions on the basis of genotype [12]. Loosely speaking, the algorithm attempts to identify a predetermined number of clusters, K, that have distinctive allele frequencies, and assigns portions of individual genomes to these clusters. A genome can have membership in several clusters, and the algorithm reports the probability distribution of the assignment of each section of the genome. We analyzed the data by successively increasing K from two to eight (Figure 2). For K = 2, we see an East–West gradient, potentially attributable to post-glaciation colonization routes [11]. When we increaseK to three, all accessions from northern Sweden and Finland are assigned to a single cluster together with (to varying degrees) accessions from Eastern Europe, Russia, and Central Asia. While a relationship between northern Sweden and Tajikistan may seem far-fetched, several species are known to have colonized the Scandinavian Peninsula from both north (from Russia via Finland) and south (from Europe via Denmark) after the last glaciation [13]. As we increase K from three to eight, each new cluster splits a previously existing one along plausible geographical boundaries (and identifies some accessions as mixed). Thus K = 4 separates central European (Czech, Austrian, and Croatian) accessions from the main European cluster, K = 5 separates a subset of the United States accessions from the rest of Western Europe, K = 6 separates the Central Asian and Russian accessions from northern Sweden and Finland, K = 7 separates many German and southern Swedish accessions from the rest of Europe, and K = 8 identifies a Catalan cluster. Clusters identified for K > 8 did not contain the majority of any single individual genome. When these results are superimposed on a map (Figure 3), the pattern of isolation by distance becomes obvious. Individuals are, by and large, more similar to individuals that grow nearby than to individuals from far away. Although A. thaliana commonly occurs as a weed and human commensal, this has not been sufficient to erase population structure. Population structure is also evident in the distribution of pairwise differences between individuals. In the absence of population structure, all individuals should be equally closely related on average. As shown in Figure 4, this is clearly not the case. Not only is there generally a wider range of variation than would be expected in the absence of population structure, but there are also clear outliers. Some individuals are extremely closely related: these are typically from the same local population (and will be further discussed below). Perhaps more surprising is that two stock center accessions, Cvi-0 (Cape Verde Islands) (cf. [14]) and Mr-0 (Italy), are very different from all others (and each other). The distribution of pairwise differences can be conveniently summarized using hierarchical clustering. The population structure revealed by the resulting tree (Figure S1) generally agrees with the output of Structure. Variation within and between populations Because it is highly selfing, A. thaliana has often been considered a collection of asexual lineages, or “ecotypes.” This view is completely false. Indeed, even the first sequencing survey of variation showed clear evidence of recombination between accessions [15], and a subsequent study has shown that recombination has generally been sufficient to erode genome-wide LD on a very fine scale [16]. Consequently, there is no “phylogeny” of ecotypes. However, this still leaves open the possibility that local population structure is tree-like, with individuals within the same population being much more closely related to each other than to individuals from other populations. Indeed, since A. thaliana is a selfer, it is perfectly possible for a local population to consist of a single inbred sibship. We find that this is typically not the case; most sampled populations were polymorphic (even though this part of our study had relatively little power; see Materials and Methods), and when we consider the genome-wide data, the pattern that emerges is far from tree-like. As shown in Figure 2C and 2H, only a small fraction of all sequenced fragments have patterns of polymorphism compatible with monophyly in the sense that, with respect to that fragment, the members of a particular cluster are all more closely related to each other than to members of other clusters (the yellow “US” cluster is an exception to which we return below). Instead, just as is the case for human populations [17], most polymorphisms are shared between clusters, and levels of polymorphism are in general comparable within all clusters (see Figure 2E; for clarity, only values for K = 3 are shown). The same is true with respect to the local populations, although there is great variation between regions. An interesting way to describe the relationship between individuals is to try to identify chromosomal regions shared identical by descent by looking for long identical haplotypes [18–22]. The resulting patterns clearly reveal the difference in structure between geographic regions (Figure 5). In northern Sweden, pairs of accessions sampled from the same population are invariably more closely related to each other than to accessions from other populations; however, even here populations are far from monophyletic in the sense used above (i.e., for a particular locus, the closest relative may well come from a different population). In most regions (exemplified in Figure 5 by central Europe), haplotype sharing is moderate, and is not much greater within than between populations. The US sample is again different: here, pairs of accessions often appear to share entire chromosomes, and are equally likely to do so between populations. Regional variation in the level of population structure is also evident in the distribution of pairwise differences (see Figures 4 and S1). Individuals that are extremely closely related (less than ten differences) are almost always pairs from the same local population (two pairs from northern Sweden, one pair from Finland, and one from Germany). The one exception is a trio of nearly identical individuals from different US populations. Figure 2D and 2I illustrate the variation in FST across the genome for K = 3 and K = 8. This pattern is of interest in that regions with extremely high or low values may be seen as candidates for harboring selectively important loci [23,24]. Structure summarized The picture that emerges is that of a single, large, globally distributed population with historical gene flow sufficient to ensure that variation is shared worldwide, yet limited enough to cause considerable population structure. Genetic exchange is not only geographic; there has been enough outcrossing to ensure that LD decays within 25–50 kb on average (Figure 6), which is comparable to what has been observed in humans. All of this may seem surprising given the highly selfing nature of A. thaliana, but it is in fact completely compatible with theoretical predictions [16,25]. The only exception to this pattern comes from the US. Our sample from the US Midwest is clearly a heterogeneous collection, characterized by genome-wide LD and haplotype sharing. Especially notable is extensive haplotype sharing with accessions from other regions, in particular with the United Kingdom. All this strongly suggests that A. thaliana is a recent human introduction to the New World, and that the introduction severely reduced haplotype variation through bottleneck effects, causing genome-wide LD [16]. Since we have population samples from only the Midwest, we cannot rule out the possibility that the pattern is different in other parts of the US; however, we note that our one non-Midwestern US accession, Yo-0 (from Yosemite, California), is almost identical to some of our Midwestern accessions, and that a recent survey of variation in 53 US populations found no evidence of differentiation across the continent [26]. It should be emphasized that we view both Structure and hierarchical clustering as tools for exploring the data. The results should not be taken literally. For example, we do not believe that there are K = 8 random-mating populations in A. thaliana, as might be suggested by Figures 2 and 3, nor do we believe that populations are related in a tree-like manner, as might be suggested by Figure S1. Global Patterns Allele frequency distribution Note that θ̂S is consistently higher than θ̂P (see Figure 1). This is typically caused by an unusually high ratio of rare to common alleles, compared to what is expected under standard population genetics models [27]. A closer look at the distribution of allele frequencies reveals that this is indeed the case (Figure 7A). The observed distribution is skewed toward rare alleles compared to standard neutral expectations. A possible explanation for this is recent population growth [28]; however, the skew is much greater for nonsynonymous than for synonymous polymorphisms, suggesting that selective factors must also be involved. The effect of this genome-wide deviation from standard neutral models on “tests of selection” can be dramatic. These tests typically assume that the standard neutral model describes most of the genome, and interpret deviations at particular loci as signs of selection [3]. Our results show clearly that this procedure is not appropriate for A. thaliana (see also [29]). It is, of course, not a new finding that demographic history can invalidate tests of selection, but our results provide a striking illustration of the potential seriousness of the problem. For example, the mean value of one popular statistic, Tajima's D [27], is −0.8 rather than the (approximately) zero expected under simple neutral models, and the variance is also larger than predicted (Figure 7B). Positive values of Tajima's D are typically attributed to balancing selection: 2% of our fragments are significantly positive at the 1% level. Negative values are typically attributed to directional selection: 15% of our fragments are significantly negative at the 1% level. Although some of these deviations may, of course, actually be due to selection, our data suggest that tests based on standard cutoff values are anticonservative in both tails of the distribution. Consistent with this interpretation, a much higher fraction of studied genes have been reported to be under selection in A. thaliana than in other species [30]. Variation in the level of polymorphism While it is straightforward to fit a neutral model with growth to the observed distribution of Tajima's D (Figure 7B), the value of this exercise is doubtful. First, it is clear from Figure 7A that selection must be part of the reason for the skewed allele frequency distribution. Second, a model with growth would, in fact, fit other aspects of the data less well. In particular, population growth tends to reduce the variability in coalescence times across the genome compared to models with constant population size, resulting in less variation in the level of polymorphism between loci. We see the opposite: the variance between loci is considerably greater than expected under a standard neutral model with constant population size (Figure 7C). In addition, the distribution is heavily skewed and displays a long tail of extremely high values. There are several reasons to expect a poor fit to a simple neutral model. One is variation across the genome in the “neutral” mutation rate, θ, either due to variation in the level of selective constraint or due to variation in the actual, underlying mutation rate. Since the excess variability is observed equally for coding and noncoding DNA, we would have to invoke the latter. The extent to which variation in the mutation rate contributes to the pattern in Figure 7C can be estimated as soon as divergence data from a closely related outgroup species become available. Another factor likely to contribute to the variation in the level of polymorphism is population structure such as that described in the first part of the paper. It is well known that population structure can inflate the variance of coalescence times as well as induce a tail of very large values that would result in patterns of variability such as the one observed [6]. However, strong population structure is generally expected to push the distribution of Tajima's D toward positive values, which is the opposite of what we observe. It is possible that some model that involves growth (perhaps in conjunction with bottlenecks), structure, and finite sampling [7] could explain the pattern observed in A. thaliana, but we have not been able to find such a model (see also [29]). Genomic patterns of polymorphism It turns out that the pattern of polymorphism is affected by not only demographic forces, but also factors intrinsic to the genome. Figure 7D shows that polymorphism in noncoding regions is negatively correlated with local gene density. This mirrors the positive correlation between polymorphism and local recombination rates that was first noted in Drosophila [31] and has since been observed in a wide range of organisms [32]. A possible explanation is that recombination itself is mutagenic: this appears to explain the correlation observed in humans, but is not sufficient to explain the phenomenon in general [32,33]. Instead, it has been proposed that variation is reduced in low-recombination regions because of a “hitchhiking effect” due to selection on linked sites [34,35], in the form of either positive selection (“selective sweeps”) [36] or purifying selection (“background selection”) [37]. Such hitchhiking effects would be stronger in low-recombination regions because sites in these regions are affected by selection on larger pieces of the chromosome (i.e., more genes). Recombination is thus used as a proxy for gene density: the real prediction of these models is that polymorphism should decrease with gene density. This is precisely what we observe. The level of polymorphism is insignificantly positively correlated with recombination (suggesting that although recombination may well be mutagenic, this does not explain the phenomenon), but is strongly negatively correlated with gene density. Two factors suggest that background selection rather than selective sweeps is responsible for the correlation. First, unlike background selection, selective sweeps are expected to skew Tajima's D toward negative values. However, we find no correlation between Tajima's D and gene density. Second, it is clear from Figure 7A that a significant load of deleterious mutations (as is required by background selection) exists in A. thaliana. Figure 7E reveals that polymorphism is also positively correlated with segmental duplication, similar to what has been observed in humans [38–40]. In humans, the phenomenon appears to be largely due to misclassification of paralogous sequence variants as SNPs. Distinguishing between paralogous sequence variants and true SNPs is difficult for human data, where putative SNPs are typically detected in small samples of highly heterozygous individuals. In contrast, our data consist of high-quality sequences from a large sample of almost completely homozygous individuals, and we are therefore confident that nearly all of our polymorphisms are genuine (see Materials and Methods); fragments that have a close match elsewhere in the genome thus appear to be more variable than fragments that do not. We hypothesize that this is caused by a low level of intergenic gene conversion that serves to “shuffle” variation between loci. Such gene conversion has long been known to occur in large multigene families (“concerted evolution”; [41]): our results suggest that it may be a general phenomenon. Recombination and gene conversion We noted above (see Figure 6) that LD decays within 25–50 kb, somewhat faster than has previously been suggested [16]. At least 25% of our sequenced fragments show evidence of recombination (using the four-gamete test; [42]). Estimates based on coalescent models suggest an effective population recombination rate (e.g., [6]) of approximately ρ = 2 × 10−4 per basepair (V. P., B. P., P. Marjoram, J. W., and M. N., unpublished data). Given our estimates of the mutation rate θ (see Figure 1), this implies a ratio θ/ρ of about 20. The short-range pattern of LD in several species is incompatible with the long-range pattern; there is too little of the former relative to the latter for a simple recombination model that includes only crossing over to explain the data [43–48]. Possible explanations include gene conversion and multiple mutations (i.e., each SNP not being due to a unique mutation event), both of which will erode short-range LD [46]. There is clear evidence for both phenomena in our data. We observed a total of 315 tri-allelic SNPs. Since less than 50% of all multiple-hit mutations will result in more than two distinct alleles, this suggests that a total of more than 600 of our SNPs are, in fact, the product of multiple mutations. We also observed three fragments that show clear evidence for gene conversion in that a single gene conversion event (i.e., a double cross-over within 500 bp, as would result from the resolution of a single Holiday junction) suffices to explain a complicated pattern of polymorphisms based on multiple SNPs in the fragment. Coalescent-based analyses based on the fine-scale pattern of polymorphism suggest that gene conversion is about five times more common than crossing over, in agreement with previous population genetic analyses [48], as well as with direct estimates based on tetrad analysis [49]. Concluding Remarks We have shown that the pattern of polymorphism in A. thaliana, a selfing human commensal, generally agrees with what would be expected for a widely distributed sexually reproducing species. Although there is significant population structure, polymorphism is shared worldwide. As predicted by population genetics theory [25], the only clear indications of selfing in the pattern of polymorphism are that individuals are typically homozygous, and that LD is unusually extensive. The scale of our study allows us to consider the genomic distribution of statistics commonly used to summarize polymorphism data. We find that these distributions generally deviate significantly from what is assumed by standard population genetics models. This highlights the danger of using highly parameterized models based on untested assumptions for inference in population genetics. Commonly used “tests of selection” are simply not valid in A. thaliana (cf. [29,30]). Large-scale analyses in other organisms have similarly found genome-wide deviations from standard models (e.g., [43,50,51]). As data continue to accumulate, the focus of population geneticists will surely have to shift from rejecting null models that do not fit particular loci to finding models that actually do explain the bulk of the data. Genomic polymorphism data are required to develop more robust inference methods, and will enable us to study phenomena that are intrinsically genomic (e.g., the correlations in Figure 7D and 7E). More importantly, however, these data will help identify the functional polymorphisms that underlie phenotypic variation. The pattern of polymorphism in A. thaliana, characterized by humanlike levels of LD but much higher SNP density, coupled with the availability of naturally occurring inbred lines, makes the species ideal for LD mapping. Although the strong population structure is likely to cause a high rate of spurious genotype–phenotype associations, these problems can easily be overcome through direct experimental verification using crosses or transgenics. This significantly strengthens the position of A. thaliana as a model for evolutionary functional genomics. Materials and Methods Sampling The sample of 96 individuals included pairs of individuals from 25 local “populations” (typically sampled within a few hundred meters of each other, often much closer) as well as a worldwide survey of commonly used stock center accessions (Tables S1 and S2). Where possible, four populations were sampled from each of several regions. The sample was generated by screening a larger set of accessions with a small number of markers to avoid inbred siblings or extensively heterozygous individuals (E. B., E. Stahl, C. T., M. N., M. K., and J. B., unpublished data). Accessions were genotyped using 11 unlinked markers (five microsatellites, two indel R-genes, and four housekeeping genes with previously identified polymorphisms). To ensure that individuals sampled from local populations were not part of inbred sibships, four (three in one case) individuals from each of 37 populations were tested. Polymorphism was found in 25 of these populations, and a pair of nonidentical individuals was selected at random from each (Table S1). Some accessions not from the same population were also found to be identical with respect to these markers (Col-0 and Lp2-2; Ts-1 and Shahdara), but these were included nonetheless. Five accessions were found to be heterozygous and were eliminated. Four of these were from the population samples, and one, Ms-0, was from the stock center. Further testing of two additional Ms-0 lines revealed one more heterozygote and one homozygote, which was included. In spite of these precautions, one sequenced stock center accession, Van-0, turned out to be extensively heterozygous and was eliminated from the analyses in this paper (bringing the sample size to 95). Data generation We used direct, PCR-based sequencing of genomic DNA, with primers designed from the A. thaliana reference sequence to cover the genome relatively uniformly. To achieve uniform density of our fragments, the reference genome (releases January 7, 2002, and April 17, 2003) was first divided into equally spaced regions. The last 10 kb of each region then served as an input record to Primer3 (v. 0.6). The designed primer pairs returned from Primer3 for each region were then screened for uniqueness and quality. To screen for uniqueness, all primer pairs were BLASTed (BLAST v. 2.2.3) against both the reference genome as well as BAC datasets downloaded from the Arabidopsis Information Resource (http://www.arabidopsis.org/). Any primer pair that produced a hit in the same region (≤2,300 bps) was removed. Self-amplifying primers were also removed based on this same criterion. Additionally, primers with more than five BLAST hits against the reference were also discarded. To improve the quality of each fragment, any primer pair that amplified a target sequence that contained a homonucleotide run of nine bases or more was removed. All sequencing was done using ABI 3700 automated sequencers (Applied Biosystems, Foster City, California, United States). All fragments were sequenced in both directions. Chromatograms were initially base-called with Phred (v. 0.020425.c) and trimmed based on quality value. The start and end of each read was trimmed until the average quality value was 25 in a window of ten bases, and internal bases were converted to missing data when their quality value was below ten. Accessions missing one read of data were trimmed more severely (different setting were used). A combination of Phrap (v. 0.020425.c) and ClustalW (v. 1.82) was used for producing alignments using a modified weight matrix that allowed us to incorporate quality values into the ClustalW algorithm. Alignments were then visually inspected and adjusted as necessary using Consed (v. 13.0). Polyphred (v. 4.20) was used to flag potential heterozygotes, which were confirmed by visual inspection of chromatograms. Additional trimming was performed as necessary for accessions with multiple false polymorphisms and low-quality sequence after a visual inspection of chromatograms. Whenever two reads from the same accession disagreed, the final call was made by visually inspecting chromatograms unless the difference in quality value made the final call obvious. Potential polymorphisms in each alignment were then verified by a second person. All alleles found only once or twice in the sample were verified by visually inspecting the chromatograms. When this inspection did not reveal a chromatogram peak clearly different from the other accessions, the base was changed to missing data. This would, if anything, produce a slight underestimate in alleles of frequency one and two. Higher-frequency polymorphisms with generally low-quality values (20 or lower) were also verified by checking the chromatograms. A total of 876 high-quality fragment alignments were obtained from 979 PCR primers and used for the analyses in this paper. Of the remaining PCR primers, some failed at the stage of PCR amplification and sequencing, while some produced sequencing output that could not be base-called with certainty when the sequence quality was particularly low or when there was evidence that the primer pairs amplified two or more different products in some of the accessions. To calculate genetic distances, we used a set of markers that have been genetically mapped to the Lister and Dean recombinant inbred lines and that also can be mapped to the AGI reference genome. Some markers were removed so that both physical position and genetic position were monotonically increasing functions. All data are publicly available through our Web site (http://walnut.usc.edu/2010, and also as Dataset S1. Population structure To infer population structure and assign accessions to populations, we used a model-based clustering algorithm implemented in Structure v. 2.0 [12]. Since A. thaliana is largely homozygous, we used a haploid setting. We used the “linkage model” with “correlated allele frequencies” in Structure, where genetic distances (calculated by fitting a third-order polynomial to the Lister and Dean recombinant inbred mapping data) were used to indicate locus proximity. The algorithm was run with a burn-in length of 50,000 MCMC iterations and then 20,000 iterations for estimating the parameters. This was repeated ten times for each K (ranging from one to 17). In these analyses, each fragment-haplotype was treated as a marker at a multiallelic locus, so that two accessions had a different type if they differed at any site in the fragment. The likelihood of the data increases with K from K = 1 until K = 7 (using the Wilcoxon two-sample test to compare the ten runs for each K; two-sided p = 0.001 for K = 7 versus K = 6). The likelihoods of K = 7 and K = 8 were similar (two-sided p = 0.97). For K > 7, the likelihoods of different runs were more variable than for K ≤ 7, with the added variability caused only by runs with lower likelihoods. Moreover, the additional clusters for K > 8 do not have a majority of the genome for any of the accessions. These observations taken together indicate that it is less meaningful to choose K > 8. In displaying the output from Structure, we computed an average of the ten runs for each K. Because there are K! distinct permutations of the clusters that all correspond to equivalent assignments of membership coefficients to accessions, and because independent runs may produce different permutations, to compute an average we first permuted the clusters to align the solutions. For R runs, there are (K!) R − 1 ways of aligning clusters across runs. To determine which of the clusters of each of the other runs corresponds to a specific cluster in a given run, the symmetric similarity coefficient (SSC) was used with the matrices of membership coefficients (based on the genome-wide average). For a given K, the SSC was calculated for all combinations of pairs of runs: where Q i and Q j are the membership matrices of runs i and j (i ≠ j), P is a permutation; the minimum is taken over all permutations, S is a probability matrix of K columns where all elements equal 1/K, and A F is the Frobenius matrix norm [52]. This is a slight adaptation of the asymmetric similarity coefficient used in previous work [17]. For K = 2, the runs were permuted to the arrangement that maximizes the sum of SSC across pairs of runs, and an average of the membership matrices across runs was then taken. For K > 2, it was not feasible to test all possible arrangements; therefore, the following greedy algorithm was used. (1) Fix a permutation, P 1, of one (randomly chosen) run, . (2) Randomly choose a second run, Q2, and fix the permutation, P 2, that maximizes . (3) Continue sequentially with each remaining run, Q x , where x = 3,…, R, and fix the permutation, Px , that maximizes for the current run, Q x . Because the choice of starting run can affect the result, we tested all ten possibilities for the starting run. For K = 2 to K = 8, there were thus 70 possible ways of starting the algorithm, and in only two of 70 possible cases was a different result obtained. These two solutions differed from the common solution by switching one pair of clusters in one run (2.5% of the clusters differed from the common solution), and switching one pair of clusters in two different runs (5%). We tested for monophyly as follows. For every variable site in a fragment, each cluster was checked for the presence of both alleles as well as for the presence of both alleles outside the cluster. If a variable site in a fragment had both alleles within the cluster as well as outside the cluster, then the whole fragment was deemed nonmonophyletic for that specific cluster. Clusters that failed to show nonmonophyly for a fragment were considered monophyletic for that fragment. Fragments with less than five variable sites and clusters with less than five accessions were always considered to be nonmonophyletic. FST for the inferred clusters was computed as: where θ̂P, total is the average number of pairwise differences per site for all pairs of accessions, and θ̂P, within is the average number of pairwise differences per site for all pairs within cluster i. Of the total of 95 accessions, 40 were hierarchically sampled in pairs from four populations in each of five regions (Table S1). The total amount of variation among these 40 accessions, θ̂P, among40 , was computed by taking the total average pairwise difference for all pairs of the 40 accessions, whereas the amount of variation within populations, θ̂P, withinpop , was calculated by taking the mean of the total average pairwise difference for the pairs of accessions in the 20 populations. The level of variation among geographical regions, θ̂P, amongreg , was computed as the difference between θ̂P, among40 and the mean of the total average pairwise differences for all pairs of accessions within regions. The level of variation among populations, θ̂P, amongpop , was calculated from the following expression: Genomic patterns of polymorphism Correlations were identified between levels of polymorphism and local gene density or degree of duplication (Figure 7). The local gene density was measured as open reading frames per centimorgan in windows of size greater than or equal to 1 Mb (using genetically mapped markers from the Lister and Dean recombinant inbred data as endpoints). The number of open reading frames (excluding pseudogenes and RNA genes) from the annotated reference sequence that fell between these window endpoints was counted, and length in centimorgans of each window was estimated from the genetic distance of the markers used as window endpoints. Correlations were quantified using Spearman's rank correlation, and the significance of the observed values was evaluated using 50,000 permutations that maintained the chromosomal order of all observations but that shuffled the relative positions of the two variables. (For each variable, the lists representing the consecutive values within each chromosome were concatenated in random order and direction to form a circle. The two circles were then randomly aligned with each other.) This is necessary to avoid inflated significance values due to autocorrelations along the chromosomes (of both variables). Using this procedure, the rank correlation between θ̂S in nonexon sequences and gene density is −0.27 (p = 0.0014), and the rank correlation between θ̂S in nonexon sequences and the negative log of the second-best BLAST e-value is 0.13 (p = 0.0018). To investigate the effect of population structure, all analyses (except those of population structure) were repeated with the outliers in Figure 4 removed (Cvi-0, Mr-0, and all but one randomly chosen member of each closely related group). All conclusions remain qualitatively the same. Supporting Information Dataset S1 All Data Used in the Paper (912 KB ZIP). Click here for additional data file. Figure S1 Hierarchical Clustering of Individuals Based on Pairwise Differences (12 MB EPS). Click here for additional data file. Table S1 The Population Samples Used in the Project (8 KB PDF). Click here for additional data file. Table S2 The Individual Accessions Used in the Project (8 KB PDF). Click here for additional data file.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Farmers and their languages: the first expansions.

            The largest movements and replacements of human populations since the end of the Ice Ages resulted from the geographically uneven rise of food production around the world. The first farming societies thereby gained great advantages over hunter-gatherer societies. But most of those resulting shifts of populations and languages are complex, controversial, or both. We discuss the main complications and specific examples involving 15 language families. Further progress will depend on interdisciplinary research that combines archaeology, crop and livestock studies, physical anthropology, genetics, and linguistics.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              An Arabidopsis Example of Association Mapping in Structured Samples

              Introduction As sequencing and genotyping costs continue to decrease, association mapping (also known as linkage disequilibrium mapping) is emerging as a powerful, general tool for identifying alleles and loci responsible for natural variation. Although its application to human disease has received most attention [1], association mapping has tremendous potential in organisms ranging from Arabidopsis thaliana [2] to cattle [3]. Indeed, given how quickly sequencing and genotyping technology is developing, it will soon be feasible to carry out genome-wide association scans in almost any organism, including many that are not amenable to traditional mapping methods (e.g., because they cannot be crossed experimentally). A potentially serious obstacle to association mapping is confounding by population structure. It is well known that population structure may cause spurious correlations, leading to an elevated false-positive rate [4]. Whether this is likely to be a significant problem in human association studies is currently a subject of considerable debate [5–11]. Be that as it may, human association studies are typically designed as case-control studies, and the effects of confounding can thus be minimized by carefully matching cases and controls [7]. When association mapping is instead used to identify genes responsible for quantitative variation in a single population sample, there is every reason to believe that confounding will be a significant problem, especially if the trait varies geographically, as does height, skin color, or flowering time [7,12–14]. In an earlier paper, we demonstrated that confounding can be severe in a structured sample of 95 A. thaliana accessions for which genome-wide polymorphism data are available [13]. We also showed that the two best-known statistical approaches for dealing with this type of confounding, “genomic control” [15] and “structured association” [16], were not effective, the former because it lacks power in the presence of strong confounding, the latter because it did not adequately capture the complex pattern of relatedness in the sample. Although we were able to identify several previously known loci, the false-positive rate remained sufficiently elevated to make searching for new loci infeasible. In this paper, we evaluate a wider range of methods, including the mixed-model approach of Yu et al. [14] and the principal components approach of Price et al. [17]. We first investigate how these methods perform in terms of reducing the false-positive rate in genome-wide scans for a range of flowering-related phenotypes, then examine their power using data-perturbation simulations. We find that versions of the approach of Yu et al. generally perform best, but a comparison with the results of linkage mapping indicates that some confounding still remains, and, furthermore, that correcting for population structure also introduces false negatives. Nonetheless, the false-positive rate is reduced to levels that make it possibly to use our sample for association mapping, and although our scans have low power due to insufficient marker density and small sample size, we identify a number of plausible associations. Results Confounding by Population Structure Mean flowering time for each accession was measured under a variety of experimental conditions, at the University of Southern California (USC) and at the John Innes Centre (JIC). In addition, we measured expression levels of two key flowering time genes: the floral repressor FLC [19,20] and its promoter gene FRI [21]. The phenotypes used are summarized in Table 1 (and are available as Dataset S1). The genotypes were in the form of over 900 short sequenced fragments, distributed throughout the genome [22]. Depending on the analysis, each sequenced fragment was either broken up into constituent single nucleotide polymorphisms (SNPs), or treated as a multi-allelic locus (with alleles corresponding to unique haplotypes after removing singleton SNPs). The genotypes are available as Dataset S2. Table 1 Summary of Phenotypes Used The phenotypes were generally strongly correlated across accessions (Figure 1). As expected given our previous results, they were also correlated with the genome-wide pattern of relatedness (summarized using a tree in Figure 1), indicating that the null hypothesis of no association between genotype and phenotype is false in a genomic sense, and that the false-positive rate must therefore be inflated by spurious correlations [13]. Figure 1 Summary of the Data Illustrating Strong Positive Correlations between Phenotypes and between Phenotypes and Genome-Wide Relatedness The panel on the left gives the basic phenotypes used (see Table 1), with colors indicating relative values within each phenotype (white denotes missing data). The tree on the right shows a hierarchical clustering (UPGMA) of accessions based on their relative kinship (as measured by pairwise haplotype sharing). Colors for the accessions labels indicate geographic origins. And inflated it is. Figure 2 shows the cumulative distribution of p-values from association scans across all loci for all our phenotypes. The black curves correspond to naive tests of association without correction for population structure. Irrespective of phenotype, the distribution was strongly skewed towards significance, in agreement with our previous observations [13]. Figure 2 The Cumulative Distribution of p-Values in Genome-Wide Scans for All Phenotypes The different curves correspond to different approaches for correcting for population structure (described in Table 2). Without correction for population structure, all distributions are strongly skewed towards significance. The results shown here are for associations between phenotype and individual SNPs: results for haplotypes were very similar (see Figure S2). cdf, cumulative distribution function Table 2 Summary of Models Used Statistical Approaches for Reducing Confounding Pritchard et al. [16] introduced so-called “structured association” for reducing confounding due to population structure. The approach is based on first assigning individuals to subpopulations using a model-based Bayesian clustering algorithm, STRUCTURE, and then carrying out all analyses conditional on the inferred assignments. To the extent that it is possible to capture the relevant structure this way, the approach is eminently sensible, but as we have reported earlier, it does not appear to be sufficient for the sample used here [13]. The same was true in general for the present study, although performance varied greatly between phenotypes, and was in fact reasonable in a couple of cases. Yu et al. [14] recently introduced a mixed-model approach to control for population structure. The key to their approach is using a random effect to estimate the fraction of the phenotypic variation that can be explained by genome-wide correlations. The individual random deviations from the population mean are constrained by assuming that the (phenotypic) covariance between individuals is proportional to their relative relatedness (or kinship), which is estimated using genome-wide marker data. In addition to this random effect, Yu et al. used the population assignments produced by the STRUCTURE algorithm (the Q matrix) as a fixed effect in the model, as in structured association [12,14,16]. We applied the approach of Yu et al. [14] to analyze our data. We found that their full model (combining kinship and structured association) successfully reduced the false-positive rate almost to expected levels, indicating that confounding by population structure has largely been eliminated (Figure 2). The mixed model of Yu et al. includes two terms that are intended to accomplish the same thing. Population structure is modeled both as a fixed effect (using the Q matrix) and as a random effect (using the K matrix of pairwise kinship coefficients) [14]. While it is intuitively obvious that the latter captures features of the data that could not possibly be captured by the former (e.g., different levels of relatedness; see Discussion), it is by no means clear that the Q matrix should be required in addition to the kinship effect. Nonetheless, we found that it is (i.e., the K matrix was not sufficient; see Figure 2), as did Yu et al. [14]. We suspected that this might be an artifact of how kinship was estimated. The formula used by Yu et al. [14] is based on classical work on estimating kinship, defined as the fraction of the genome shared identical by descent, using marker data [23–25]. A distinction is made in this work between identity by descent and identity in state, where the latter is defined, loosely speaking, as the amount of allele sharing expected between “unrelated” individuals. This would seem to make little sense in the context of association mapping: there are no unrelated individuals [26], and identity in state implies identity by descent when using markers with a very low mutation rate (such as SNPs). It seemed to us that kinship should simply be estimated as the fraction of shared alleles. We tried this, and found that, indeed, with this new measure of kinship, the false-positive rate could often (but not always) be reduced almost as well by using kinship alone. It turns out that the best results were obtained not by considering the fraction of shared SNP alleles, but by considering the fraction of shared fragment haplotypes (recall that our data are in the form of short sequence fragments), presumably because these capture the underlying structure better. As illustrated in Figure 2, correcting for population structure using our simpler kinship estimate often worked as well as using the full model of Yu et al. [14] in terms of reducing the false-positive rate. Estimating population structure using STRUCTURE thus appears to be unnecessary in many cases. Yu et al. have independently reached similar conclusions (E. S. Buckler, G. Pressoir, J. Yu, Z. Zhang, unpublished data). However, for some phenotypes the false-positive rate was still clearly inflated, and we therefore also analyzed the data using both the Q matrix and our new kinship estimate as cofactors (see Table 2). The results from this model were very similar to those obtained using the full model of Yu et al. One reason for investigating methods that do not require the use of the Q matrix is that the STRUCTURE algorithm is computationally intensive, and may be impractical on large datasets. With this in mind, Price et al. [17] suggested that a principal components analysis (PCA) be used to summarize genome-wide patterns of relatedness instead of the Q matrix. We tried this approach as well, and found that although it performed better than using Q alone, it did not perform as well as our modified kinship matrix, nor the approaches that combine kinship estimates with Q. Finally, we tried a mixed-model approach that combined the PCA assignments (the P matrix) with kinship estimates. This approach performed similarly to mixed-models using Q, suggesting that it may be possible to replace the computationally intensive STRUCTURE algorithm with a simple PCA in the mixed-model approach. The Effect on Power The relative power of the various approaches was compared using data-perturbation simulations. Large numbers of perturbed datasets were generated by assigning a phenotypic effect (expressed as a deviation from the population mean) to randomly chosen SNPs (one per simulated dataset). These datasets were then analyzed using the various approaches, and, for each approach, we recorded the fraction of perturbed datasets for which the p-value for the simulated quantitative trait nucleotide (QTN) was in the 5th percentile (estimated from the genome-wide SNPs). Figure 3A shows the resulting power estimates as a function of the fraction of the phenotypic variation explained by the QTN. In general, approaches that reduced confounding more effectively had better power than approaches that did not, presumably because the confounding noise tended to overwhelm the signal of the true association in the latter cases. Figure 3 Power of the Different Methods to Detect a QTN The power of the different methods to detect a QTN using a 5% significance level (based on the observed distribution of p-values), as a function of the fraction of the phenotypic variation attributable to the QTN. (A) Power averaged across all simulated QTN. Methods that reduce confounding more effectively have better power. (B) Power calculated separately for QTN that showed strong correlation with population structure (“high PS,” arbitrarily defined as simulations where more than 50% of the phenotypic variation could be explained by Q (note that “P” performs better than “Q” largely because of this definition) and those that did not (“low PS”). All methods are relatively powerless to detect QTN that are strongly correlated with population structure. Results for haplotypes were similar (Figure S3). While these results might seem to indicate that it is possible to reduce the false-positive rate without paying a price in terms of an increased false-negative rate (i.e., decreased power), this is by no means the case. All the methods used here attempt to reduce confounding by identifying and eliminating genotype–phenotype associations that are in effect genome-wide because they reflect underlying population structure, leaving only those associations that cannot be accounted for by this structure. However, not all the associations thus eliminated will be false; indeed, in a genome-wide scan some of them will almost certainly be true [13]. To give an extreme example, if the causal polymorphisms are perfectly correlated with the underlying population structure (e.g., because of selection), it will not be possible to distinguish between true and false positives statistically, and any attempt to remove the latter will remove the former. This effect could also be seen in our power study when we separated the simulated QTN into those that happened to be strongly associated with population structure and those that did not. As shown in Figure 3B, power to detect the former was considerably poorer than power to detect the latter. Residual Confounding Since several of the methods discussed above produced p-value distributions that looked reasonable, we proceeded to analyze the results of our genome-wide scan using the methods that produced reasonably flat p-value distributions. For these analyses, we decided to focus on treating our sequence data as short haplotypes, partly because we expect power to be higher when the information inherent in linkage disequilibrium between SNPs is utilized [27], partly to reduce the problem of multiple comparisons (which also reduces power). Table 3 lists the number of loci that are significant using a nominal 0.1% level for each phenotype and each method (for detailed output of the scans, see Dataset S3 and Figures S4–S8). The number of significant loci under the naive Kruskal-Wallis approach is included for comparison. It is clear that there were many more significant associations than could be explained by chance alone. For several reasons, we believe that many, if not most, of these are spurious correlations due to residual confounding rather than true associations. Table 3 Number of Significant Loci at a Nominal 0.1% Level The first reason for skepticism is general pessimism about the power of the study. In other words, we would not have expected to find such a high number of true positives. This argument relies partly on a priori notions about the genetic architecture of the trait (in particular about genetic heterogeneity), and partly on demonstrable limitations of the study. Our study is clearly underpowered in two different ways [13]. First, the sample is too small to detect anything but very major quantitative trait loci (QTL) even if we assume that the causal polymorphisms themselves have been included in the sequencing. Figure 4 shows the expected power to detect such loci (at a nominal significance level of 0.1%) as estimated from our data perturbation simulations. To put these results in perspective, note that the major loss-of-function alleles at the vernalization-response locus FRI explained between 14% and 24% of the variation for flowering time in the absence of a vernalization treatment (Table 1). Expected power to detect a locus like FRI, by most standards a major QTL, would thus be 40%–80% in our sample. Second, the marker density is insufficient given the extent of linkage disequilibrium in the sample. It is difficult to quantify the resulting loss in power because linkage disequilibrium is so variable, but an optimistic estimate is that 20% of the genome is effectively being screened [13]. Putting these results together, it is clear that we should not expect to detect more than a minor fraction of major genes. If we believe that all the associations in Table 3 are real, we must thus also believe that there are at least an order of magnitude more major loci that have not been detected. As we shall see below, such a belief is contradicted by data. Figure 4 Power to Detect Associations Using a Nominal 0.1% Significance Level The QTN is assumed to lie within a sequenced haplotype, but power is nonetheless poor even for major QTL. The second reason for skepticism is that the associations observed strongly suggest residual confounding. We have previously noted that many of the strongest associations seem to be due to allele sharing among very late-flowering accessions from Finland and northern Sweden [13,27]. These associations have largely been removed by the methods used here, but probably not completely. Many of the remaining significant associations in this study involved small clusters of these late-flowering accessions with a few others thrown in. It is likely that the correlation with population structure for these clusters was not sufficiently strong for the various methods to eliminate them. The third and strongest reason for skepticism is direct comparison with linkage mapping results [28]. Figure 5 shows a comparison between the genome-wide association scans for flowering time after four weeks vernalization at the John Innes Centre, and the linkage mapping experiments undertaken under similar conditions (Figures S9 and S10 show the results of different crosses, undertaken at zero and eight weeks vernalization, respectively). The existence of a (true) association at a particular locus implies that a QTL overlapping this locus should exist in crosses involving accessions that carry different alleles for the locus in question (the reverse is not true; the existence of a QTL in cross does not imply the existence of a population association because the QTL could be due to a rare allele with negligible effect on the population variation). Associations that ought to correspond to a QTL in a particular cross have been labeled accordingly in Figure 5. The overall lack of correspondence between the association and linkage mapping results is obvious. One of the surprising results of the linkage mapping experiments was that, across six crosses, almost all of the variation was due to QTL in three genomic regions: the top of Chromosome 4 (certainly involving FRI, and possibly also other loci), the top of Chromosome 5 (with FLC as a strong candidate for at least part of the effect), and the bottom of Chromosome 1 (certainly involving multiple QTL). There was no evidence of any QTL on Chromosomes 2 or 3. Nominally significant associations, in contrast, although far from uniformly distributed, were found across the genome, indicating that most are likely to be spurious even after confounding has been reduced statistically. Note that this argument relies on the linkage mapping experiments having sufficient power; however, it seems unlikely that a QTL would be detectable as an association in a heterogeneous sample of 96 accessions and not be detectable in an F2 cross using twice the sample size. Figure 5 Comparison of Association and Linkage Mapping Results for the Phenotype JIC4W The position of candidate genes with higher marker density are highlighted in yellow. (A) The QTL LOD scores for two crosses: Col-0 × Lov-1 (pink curve) and Col-0 × Edi-0 (cyan curve). The dashed grey line marks a genome-wide permutation significance level of 5%. (B) Negative log p-values from a genome-wide scan using the Q + K model. (C) Negative log p-values from a genome-wide scan using a naive Kruskal-Wallis approach. (B and C) The dashed grey line marks a nominal significance level of 0.1%, and the colored stars denote associations that should correspond to QTL in the cross displayed using the same color (because the appropriate alleles are segregating; see text). Plausible Associations Given that confounding remains, it was clear that the best we could hope to accomplish was compiling a list of plausible associations worthy of further study. How this should be done is a highly subjective matter. We decided to use consistency across association methods as our main criterion. Specifically, we only consider loci significant at the 0.1% level across all five methods in Table 3. This procedure resulted in a short list of seven loci, shown in Table 4. Table 4 The Most Promising Associations Several features of Table 4 support the notion that these associations are good candidates for being true associations. First, FRI, the one locus known a priori to be involved in flowering time variation, generally shows the pattern of association we would expect it to show. Particularly noteworthy was the incredibly strong association with FRI expression levels, almost completely due to the cis-regulatory mutation present in Ler and 12 other accessions. Allelic variation at FRI was also associated with variation in FLC expression, in agreement with what is known about the function of these genes [29]. Second, leaving FRI aside, two out six of the significant associations occurred in our candidate genes. Since 18 out of 889 marker loci were located in candidate genes, this is sixteen times more than would be expected by chance. Third, one of the four associations not in a gene identified a priori as a candidate, was in a gene identified a posteriori as a good candidate for being involved in variation for the phenotype with which it was associated (CLF and response to differences in day length [30,31]). Finally, note that none of the associations in Table 4 are contradicted by the crosses. The phenotypes are different except for AT4G02880, which is significant for JIC0W (Figure S9); this association coincides with the QTL overlapping FRI. Published Associations Several recent papers have reported associations between flowering time and genotypes at candidate loci. As part of a study involving field measurement of flowering times, Olsen et al. [32] and Caicedo et al. [33] found associations with CRY2 and FLC, respectively. They claimed to have controlled for population structure by using 79 AFLP markers to identify an “unstratified” population in which to test for associations. This claim was directly contradicted by the observation that the significant associations reported in these papers showed clear geographic patterns, however, and a simple analysis of the distribution of p-values across the AFLP markers reveals a skew similar to the ones we have discussed in this paper (Figure S12). A separate study by Balasubramanian et al. [34] reported associations with PHYC. Although the p-value distribution was strongly skewed in this study as well (see their Figure 4), the claim of association was rather more convincing because it was supported by several independent sources of evidence (including linkage mapping results). We decided to investigate these published polymorphisms in our sample, where the genome-wide data would make it possible to evaluate the significance of any associations. Using the various mixed-model approaches, PHYC was marginally significant for ±V (LD) and VERN (p-values ranged from 0.01 to 0.07 depending on the method used, see Table S1), while FLC showed a tendency towards significance (p-values ranged from 0.03 to 0.13) for SD, SD/LD (V), and FLC. CRY2 showed no sign of being associated with any phenotype. Clearly, none of these polymorphisms would have been picked up in a genome-wide scan. Naive analyses using standard linear methods without correction for population structure (as in the original papers) often identified what appeared to be highly significant associations. Furthermore, adding FRI genotype and latitude as terms to the model (as was done in some of the original papers) often had a dramatic effect on the results. For example, CRY2 was associated (at a nominal 5% significance level) with LDV, SDV, and JIC8W, but became associated with 11 out of 16 phenotypes if FRI genotype was added to the model. The significance of a particular term was often strongly dependent on which other terms were included in the model (Tables S2–S4). This is probably to be expected; in the presence of confounding population structure, including these kinds of terms is essentially a form of partial genomic control, no different from including the Q matrix [14], the P matrix [17], or selected control SNPs [11]. The nominal p-values are of course affected by confounding population structure as discussed earlier in this paper, and we therefore attempted to assess their significance by comparing them to the genome-wide distribution of p-values, obtained by testing either all SNPs or short haplotypes in our data using the same model. While this might appear to be a robust procedure guaranteed to yield valid p-values, it is in fact anti-conservative because the candidate polymorphisms were ascertained to be informative about local haplotype structure. For example, the SNPs used to “tag” the three CRY2 “haplogroups” [32] are not random SNPs that can be compared to randomly chosen genome-wide SNPs; they are SNPs that were selected precisely because they identify highly diverged haplotypes. It is reasonable to assume that such SNPs are also more strongly associated with population structure than are random SNPs. By comparing a tag SNP association to the genome-wide distribution based on random SNPs, we are thus likely to exaggerate its significance. Comparing to haplotype markers should help, but there is no guarantee. Even so, we found that most of the nominally significant associations disappear when the genome-wide distribution was used as a control. Considering the number of phenotypes tested, CRY2 showed no significant associations when compared to the genome-wide distribution of haplotype tests. There were more significant associations than could be expected when genome-wide SNPs were used as a control, but there are several reasons to be skeptical of these: first, the CRY2 polymorphism is in fact a haplotype polymorphism with three alleles, and should thus be compared to the haplotype distribution; second, most of the associations disappear if a CRY2 by FRI interaction term is added to the model; and third, at least one of the associations is directly contradicted by our linkage mapping results (different CRY2 haplotypes segregate in the Col-0 × Ull2–5 cross shown in Figure S10, yet there is no QTL at CRY2 despite an association for JIC8W). The FLC polymorphism identified by Caicedo et al. [33] showed a weak but consistent association with the response to differences in day length, SD/LD (V). A similar association was also obtained using mixed models. The FLC polymorphism was only segregating in one of our crosses, Col-0 × Edi-0, the only one for which a strong QTL effect overlapping FLC was not seen (Figure 5). Note that this does not contradict a SD/LD (V) association since the phenotype measured in the cross was JIC4W. PHYC, finally, was not significantly associated with any phenotype. However, in agreement with previous results [34], PHYC appeared to interact with latitude. The PHYC by Latitude interaction term was very significant for ±V (SD), and weakly significant for ±V (LD) and VERN. The latter two were also weakly significant in the mixed-model analyses. Although the PHYC polymorphism was segregating in all our crosses, we never observed a QTL overlapping it. Again, the phenotypes measured in the crosses are very different from the ones that appear to be associated with PHYC. Discussion There is currently great interest in applying association mapping to a wide range of organisms, many of which (e.g., maize [12] and dog [35]) have heavily structured populations. Our study provides a dramatic example of the difficulties that may be encountered. Although our example may ultimately turn out to have been extreme, we would caution against premature optimism. Indeed, there should be no reason ever to assume anything about the importance of population structure in a particular study; it is straightforward to determine the distribution of test statistics empirically by calculating them across large numbers of loci. In genome-wide scans, large numbers of loci will by definition always be available, and for candidate locus studies, our results simply illustrate the need for an appropriate genomic control. Any association result not accompanied by a demonstration that the reported significance values are actually meaningful should be treated with suspicion (unless supported by strong experimental evidence). We took advantage of our genome-wide data to evaluate published claims of association in A. thaliana. We found no evidence for association between the CRY2 polymorphism reported by Olsen et al. [32] and any of our phenotypes, while the FLC polymorphism reported by Caicedo et al. [33] is at best weakly associated with the flowering response to differences in day length after vernalization. In contrast, the PHYC polymorphism reported by Balasubramanian et al. [34] does appear to show a significant interaction with latitude in determining vernalization response, in broad agreement with the original findings. While it is of course not possible to disprove a published association using a different sample and different phenotypic measurements, it is notable that neither Olsen et al. [32] nor Caicedo et al. [33] employed an appropriate genomic control, while Balasubramanian et al. [34] based their claim on a ranking of p-values (from a small genomic control) and, more importantly, several independent sources of evidence. A number of statistical approaches for reducing confounding by population structure have been proposed. Methods, like “genomic control” [15], which simply rescale p-values without changing the rankings of loci, are not likely to be useful in genome-wide scans where the existence of true positives is not in doubt. Most biologists will simply rank the associations by nominal significance and proceed to test the most significant. Methods that attempt to separate true from spurious associations, like the ones used in this study, are of much greater interest. Our results demonstrate that such methods can be very effective at reducing (not eliminating) confounding. It is clear that this is a hard problem, and more theoretical work is certainly warranted. Comprehensive simulation studies (with known null distributions) would be required to determine how the different methods perform in terms of the false-positive rate. The mixed-model approach [14,36] seems particularly promising, and further research would probably be worthwhile. General principles for how kinship (the K matrix) should be estimated might be attainable, whereas the utility of estimators of population structure are likely to vary greatly from case to case. The underlying structure in our sample appears to be some form of isolation-by-distance [22], and both STRUCTURE [18] and PCA [17] appear to capture the relevant aspects of this structure reasonably well (Table 1). The first two principal components from the PCA are in fact strongly correlated with longitude and latitude (Figure S11). At the same time, it is important to recognize that there are limits to what can be achieved using statistics. As noted above, any method that effectively removes confounding will also effectively remove true positives that are strongly correlated with population structure. Thus, while reducing confounding makes the common, widely distributed alleles of FRI stand out more clearly in our study, it also eliminates many of the best candidates for the extreme late-flowering phenotype characteristic of the northern accessions. Because these accessions are so distinct on a genome-wide level, any late-flowering allele shared by all or most of them would (and should) be removed by a statistical method that reduces confounding. There has to be segregation for mapping to work. Interestingly, FLC, a central regulator of flowering, may be an example of this. Allelic variation at FLC has been shown to affect flowering [37,38], and five of our F2 crosses revealed a strong flowering QTL centered on it [28]. FLC expression is strongly correlated with flowering time (Table 1 and [39]). In our naive analysis (i.e., without correcting for population structure), FLC is one of the strongest associations for many phenotypes; however, because the significance is due to allele sharing among a large subset of northern accessions, it is down-weighted when population structure is taken into account (this can be seen in Figure 5, for example). It should be noted that the strong association we find at FLC (whether causative or spurious) does not correspond to the association previously reported by Caicedo et al. [33]. As discussed above, the polymorphism studied by these authors is also associated with several of our phenotypes (Tables S2–S4), but the associations are much weaker. In conclusion, we should emphasize that although this paper has focused on the difficulties caused by confounding, we do not mean to say that these studies are impossible, merely that caution is needed. The statistical methods we used look promising, and we did identify several interesting associations. It must be remembered that, in a small and heavily structured sample such as the one used here, we should really only expect to be able to find polymorphisms with a major effect on the phenotypic variation that segregate widely across the geographic distribution of the sample. FRI is an example, and we do indeed find it. Of the new associations we have identified (Table 4), the most promising is probably the common polymorphism at CLF, which may be a major factor in determining the differential flowering response to day length. However, the large fraction of the phenotypic variation that is due to the very late accessions probably cannot be dissected further using this sample. There is every reason to believe that genome-wide association studies involving adequate marker densities and larger samples will be very fruitful. Materials and Methods Plant materials. The accessions used have been described in previous papers [13,22]. Genotypes. We used the resequencing data described in reference [13], plus approximately 100 fragments in and around several flowering time–related genes: FLM [40,41], FVE [42], FPA [43], VRN1 [44], FRI [21], LD [45], FCA [46], VRN2 [47], FLC [19,20], and FY [48]. Note that since the accessions are inbred, these data are in the form of short haplotypes. Furthermore, linkage disequilibrium is extensive, and there was no evidence of recombination in about three quarters of the fragments [22]. Given this, some association analyses were carried out treating the fragments as multi-allelic loci, with each haplotype corresponding to an allele. To avoid too many rare haplotypes, all singleton polymorphisms were ignored. Remaining rare haplotypes (those with frequency less than 5%) were pooled. The CRY2 haplogroups of Olsen et al. [32] were typed in our sample by resequencing two fragments in the CRY2 gene. Similarly, the FLC haplotypes of Caicedo et al. [33] were identified based on their SNP 2553 and our resequencing data. The PHYC genotyping was done by S. Balasubramanian in D. Weigel's laboratory using published methods [34]. Phenotypes. Plants were grown under either long-day (16 h light/8 h dark) or short-day (8 h light/16 h dark) conditions, using a randomized design (except with respect to growth chambers). Seeds were sown on soil and stratified for 2–3 d at 4 °C to synchronize germination. At USC, experiments were carried out in Conviron MTPS72 growth chambers programmed to have a gradual transition between light and dark conditions. Plants were grown at a constant temperature of 18 °C. The vernalization treatment was done at 4 °C on 3–4-d-old seedlings, using the same light conditions as for the unvernalized plants. Conditions at JIC were similar, but plants were grown in a custom-built walk-in controlled environment room using a temperature of 23 °C, and vernalization was done at 5 °C under short-day (8h light) conditions in a Sanyo-Gallenkamp CE with cool white fluorescent lamps only. There were thus clear differences in both temperature and light quality between the USC and JIC experiments. Flowering times were measured in days from germination to the first opening of flowers (USC) or to plant height reached at 5 cm (JIC). All measurements were replicated (using six plants at USC and ten plants at JIC). The variance across replicates was trivial compared to the variance across accessions; including it in the analyses did not seem to affect results and we therefore carried out all analyses using the accession means. The experiments were continued until either all individuals had flowered or the rate of new individuals starting to flower was deemed to have reached zero. When an experiment was stopped before all accessions had flowered, the remaining individuals were assigned the stopping date as phenotype (the resulting phenotypic distributions can be seen in Figure S1). The precise value of the phenotype assigned to these nonflowering accessions did not seem to affect the analyses (unpublished data). As indicated in Table 1, we also considered several ratios of phenotypes in order to look for responses to particular treatments. The rate of response to vernalization was studied using the 0-, 2-, 4-, and 8-wk vernalization treatments from JIC. We did this by first an exponential decay model, FT(t) = ae−bt , where FT(t) is the flowering time after t weeks of vernalization. The decay slope b was then used as the phenotype in association mapping. FRI and FLC expression was measured as described in Shindo et al. [29]. All phenotypes were normalized before analysis. Association mapping. Our basic analyses, including our implementation of “structured association” and estimation of population structure using STRUCTURE [18] have been described elsewhere [13,22]. Our mixed-model approach follows that of Yu et al. [14] closely. The vector of phenotypes, y, is modeled as where X contains the genotypes, α is vector of allele effects to be estimated, Q contains the population assignments by STRUCTURE, β is vector of subpopulation effects, I is an identity matrix, u are the random deviates due to genome-wide relatedness, and e are random deviates due to noise. Crucially, the phenotypic covariance matrix is assumed to have the form where K is the matrix of kinship coefficients, is the genetic variance attributable to genome-wide effects, and is the residual variance. In addition to the full model described by Equation 1, we considered models without the population structure term, and without the kinship term (this is “structured association”). We also varied how we estimated the kinship coefficients. Results are presented for two different choices: the K used by Yu et al. [14], and our own K*. The former contains the relative kinship coefficients of Ritland [24] estimated using SPAGeDi [49] from about 12,000 non-singleton SNPs identified by Nordborg et al. [22]. In this approach, the kinship coefficient between individuals i and j is defined as Kij = (Qij − Qm )/(1 − Qm ), where Qij is the probability of identity in state for random loci from i and j, and Qm is the average probability of identity by state for loci from random individuals from the sample. With this definition, a negative relative kinship coefficient means that the particular pair of accessions is less related than random individuals. In the analysis, all negative values were set to zero [14]. As an alternative to this approach, we introduced K*, which contains kinship coefficients defined simply as the proportion of shared haplotypes (with singletons removed) for each pair of individuals. All analyses were done using standard methods as implemented in SAS using PROC MIXED. The significance of the fixed effects was obtained from F-tests with the denominator degrees of freedom obtained using the Satterthwaite method . We tested the principal components method using a modified version of EIGENSTRAT [17]. Since our phenotypes are quantitative rather than binary, we replaced the χ2 statistic with Spearman's rank correlation. Finally, we tried a mixed-model approach where we simply replaced the Q matrix produced by STRUCTURE with a P matrix containing the eight top principal components from EIGENSTRAT. Data perturbation simulations. A simulation scheme similar to that of Yu et al. [14] was used. The idea was to perturb the existing phenotypes (we used LDV) by adding a constant additive effect to a randomly chosen causal SNP, thus maintaining the complicated correlation structure of the data. Causal SNPs were randomly chosen from SNPs with minor allele frequency in the 20%–40% range. A fixed genetic effect was added to each causal SNP. Fixed effects ranging from 0.2 to 1.4 times the standard deviation of the phenotype were used. The percentage of the total phenotypic variation explained by each causal SNP (which depends on the frequency of the SNP as well as its effect) was calculated using a standard linear regression of phenotype on SNP. Linkage mapping in F2 crosses. QTL mapping for flowering time was carried out in six F2 populations generated by crossing Col-0 to six different accessions chosen because of their differential response in our association study. Four of these crosses have previously reported by Shindo et al. [28], whereas two (Col-0 × Knox-18 and Col-0 × RRS-10) are reported for the first time here. For details about the crosses, please see [28]. Analysis was done using composite interval mapping as implemented in QTL Cartographer [50]. Supporting Information Dataset S1 Phenotypes Used in the Analysis (5 KB CSV) Click here for additional data file. Dataset S2 All Sequence Alignments Not Previously Published [13], Plus Tables of Haplotype and SNP Genotypes Used in the Analyses (1.0 MB ZIP) Click here for additional data file. Dataset S3 Output of Genome-Wide Scan for All Methods and Phenotypes (8.0 MB ZIP) Click here for additional data file. Figure S1 Histograms Showing the Distributions of the Various Phenotypes (10 KB PDF) Click here for additional data file. Figure S2 Cumulative Distribution of p-Values from Haplotype-Based Tests (1.2 MB PDF) Click here for additional data file. Figure S3 Power Comparison for Haplotype-Based Tests (86 KB PDF) Click here for additional data file. Figure S4 Genome-Wide Association Scans Using the K* Model Peak heights represent negative log p-values from tests of association between genotype and phenotype. The dashed grey line marks a nominal significance level of 0.1%. The position of the candidate genes are highlighted. Nominally significant associations occur throughout the genome, and often correlated across phenotypes. (132 KB PDF) Click here for additional data file. Figure S5 Genome-Wide Association Scans Using the Q + K Model (131 KB PDF) Click here for additional data file. Figure S6 Genome-Wide Association Scans Using the Q + K* Model (133 KB PDF) Click here for additional data file. Figure S7 Genome-Wide Association Scans Using the P + K Model (126 KB PDF) Click here for additional data file. Figure S8 Genome-Wide Association Scans Using the P + K* Model (131 KB PDF) Click here for additional data file. Figure S9 Comparison of Association and Linkage Mapping Results for the Phenotype JIC0W The position of certain candidate genes are highlighted in yellow. (A) The QTL LOD scores for the cross Col-0 × Kno-18 (blue curve). The dashed grey line marks a genome-wide permutation significance level of 5%. (B ) Negative log p-values from a genome-wide scan using the Q + K model. (C) Negative log p-values from a genome-wide scan using a naive Kruskal-Wallis approach. (B and C) The dashed grey line marks a nominal significance level of 0.1%, and the colored stars denote associations that should correspond to QTL in the cross. (64 KB PDF) Click here for additional data file. Figure S10 Comparison of Association and Linkage Mapping Results for the Phenotype JIC8W The position of certain candidate genes are highlighted in yellow. (A) The QTL LOD scores for three crosses: Col-0 × RRS-10 (black curve), Col-0 × Var2–6 (red curve), and Col-0 × Ull2–5 (green curve). The dashed grey line marks a genome-wide permutation significance level of 5%. (B) Negative log p-values from a genome-wide scan using the Q + K model. (C) Negative log p-values from a genome-wide scan using a naive Kruskal-Wallis approach. In (B) and (C), the dashed grey line marks a nominal significance level of 0.001, and the colored stars denote associations that should correspond to QTL in the cross displayed using the same color. (71 KB PDF) Click here for additional data file. Figure S11 The Top Two Principal Components from EIGENSTRAT Superimposed on Geographic Origins (54 KB PDF) Click here for additional data file. Figure S12 Cumulative Densities of p-Values for Association between the AFLP Markers and Phenotypes Used by Olsen et al. [32] and Caicedo et al. [33] (11 KB PDF) Click here for additional data file. Table S1 Mixed-Model Analysis of Previously Published Candidate Polymorphisms in CRY2 [32], FLC [33], and PHYC [34] (17 KB PDF) Click here for additional data file. Table S2 Basic Linear Model Analysis of Previously Published Candidate Polymorphisms in CRY2 [32], FLC [33], and PHYC [34] (25 KB PDF) Click here for additional data file. Table S3 Linear Model Analysis (Including a Locus by Latitude Interaction Term) of Previously Published Candidate Polymorphisms in CRY2 [32], FLC [33], and PHYC [34] (18 KB PDF) Click here for additional data file. Table S4 Linear Model Analysis (Including Both Locus by Latitude and Locus by FRI Interaction Terms) of Previously Published Candidate Polymorphisms in CRY2 [32], FLC [33], and PHYC [34] (18 KB PDF) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Genet
                plos
                plosgen
                PLoS Genetics
                Public Library of Science (San Francisco, USA )
                1553-7390
                1553-7404
                May 2008
                May 2008
                16 May 2008
                : 4
                : 5
                : e1000075
                Affiliations
                [1 ]Institut National Polytechnique de Grenoble, Grenoble, France
                [2 ]Centre National de la Recherche Scientifique, TIMC-IMAG, Faculty of Medicine, La Tronche, France
                [3 ]Department of Human Genetics, Center for Computational Medicine and Biology, University of Michigan, Ann Arbor, Michigan, United States of America
                [4 ]The Life Sciences Institute, University of Michigan, Ann Arbor, Michigan, United States of America
                University of Aarhus, Denmark
                Author notes

                Analyzed the data: OF MB MJ NR. Contributed reagents/materials/analysis tools: OF MB MJ NR. Wrote the paper: OF NR. Designed the study: OF.

                Article
                07-PLGE-RA-1016R2
                10.1371/journal.pgen.1000075
                2364639
                18483550
                80a0a2de-af94-46e5-9962-11399eacc9fc
                François et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
                History
                : 5 November 2007
                : 17 April 2008
                Page count
                Pages: 15
                Categories
                Research Article
                Ecology/Evolutionary Ecology
                Ecology/Spatial and Landscape Ecology
                Genetics and Genomics
                Genetics and Genomics/Population Genetics
                Plant Biology/Plant Genomes and Evolution

                Genetics
                Genetics

                Comments

                Comment on this article