11
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Exploiting the Genomic Diversity of Rice ( Oryza sativa L.): SNP-Typing in 11 Early-Backcross Introgression-Breeding Populations

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          This study demonstrates genotyping-by-sequencing-based single-nucleotide polymorphism (SNP)-typing in 11 early-backcross introgression populations of rice (at BC 1F 5), comprising a set of 564 diverse introgression lines and 12 parents. Sequencing using 10 Ion Proton runs generated a total of ∼943.4 million raw reads, out of which ∼881.6 million reads remained after trimming for low-quality bases. After alignment, 794,297 polymorphic SNPs were identified, and filtering resulted in LMD50 SNPs (low missing data, with each SNP, genotyped in at least 50% of the samples) for each sub-population. Every data point was supported by actual sequencing data without any imputation, eliminating imputation-induced errors in SNP calling. Genotyping substantiated the impacts of novel breeding strategy revealing: (a) the donor introgression patterns in ILs were characteristic with variable introgression frequency in different genomic regions, attributed mainly to stringent selection under abiotic stress and (b) considerably lower heterozygosity was observed in ILs. Functional annotation revealed 426 non-synonymous deleterious SNPs present in 102 loci with a range of 1–4 SNPs per locus and 120 novel SNPs. SNP-typing this diversity panel will further assist in the development of markers supporting genomic applications in molecular breeding programs.

          Related collections

          Most cited references29

          • Record: found
          • Abstract: found
          • Article: not found

          The Pattern of Polymorphism in Arabidopsis thaliana

          Introduction The field of population genetics has always been heavily influenced by mathematical models. Ever since molecular polymorphism data started to become available, in the form of allozymes [1] or DNA sequences [2], population geneticists have searched for footprints of selection by comparing the patterns of polymorphism in particular genes with the pattern expected under standard neutral models [3,4]. Considerable intellectual effort has gone into estimating model parameters such as the mutation rate θ, the recombination rate ρ, and the effective population size, Ne [3,5]. However, because of the limited availability of data, it has been difficult to determine whether the underlying models are appropriate. For example, demographic factors such as population structure and growth can cause the genome-wide pattern of variation to deviate from standard neutral models in ways that mimic selection [4,6]. Thus, without knowing whether a standard neutral model describes the pattern of variation in most of the genome, it is difficult to conclude that a particular gene has been under selection. With the advent of high-throughput genotyping and sequencing, sufficient data for the critical appraisal of standard models are starting to become available, especially in humans [7–9]. Here we report our findings from a systematic survey of genomic DNA sequence polymorphism in Arabidopsis thaliana, one of the first in any organism. Our goal was to investigate the pattern of polymorphism in a large sample of individuals, using sufficiently densely spaced loci to obtain insight into the genome-wide haplotype structure of the species. The scale of our study allows us to describe the pattern of polymorphism with unprecedented accuracy. We begin by describing how variation is distributed, with respect to space (i.e., population structure) as well as with respect to haplotypes (i.e., linkage disequilibrium [LD]). Our set of 96 individuals contained hierarchical population samples in addition to a worldwide collection of stock center accessions (Tables S1 and S2): because of this and the large number of polymorphisms, we are able to answer a number of questions about population structure that previous studies have not been able to address. In the second part of the paper, we compare the pattern of variation to predictions made by standard population genetics models. The number of loci sequenced is sufficient to investigate the distribution of important summary statistics across the genome rather than simply looking at averages (as is usually done). Results/Discussion Sequencing A total of 876 reliable alignments, representing 0.48 Mbp of the genome (or a total over all individuals of ~44 Mbp) was generated. The average sequence length is 583 bp; the average sample size across alignments is 89. Based on the A. thaliana genome annotation, the composition of our data, which includes more than 17,000 single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms, is 15% intergenic, 55% exon, 22% intron, 4% UTR, and 5% pseudogene (see Materials and Methods). The majority of fragments, 67%, contain both coding and noncoding sequence. Population Structure Overall levels of polymorphism Our estimates of the level of polymorphism are broadly comparable to what has previously been found in A. thaliana and other species, both in terms of overall levels of polymorphism, and in the degree of constraint on different kinds of sites (Figure 1; cf. [3]). The highly selfing A. thaliana does not have unusually low levels of polymorphism: the observed values are somewhat lower than for Drosophila melanogaster, and are considerably higher than for humans. A standard way of summarizing the geographical distribution of this variation is through the statistic FST, which, loosely speaking, measures the fraction of the observed genetic variation that is due to population structure [10]. Our sample contains 40 individuals that were hierarchically sampled in pairs from four populations in each of five regions (Table S1). A hierarchical analysis of variation in these individuals reveals that 33% of the global variation is segregating among individuals within populations, 35% is segregating among local populations within regions, and 26% is segregating among regions. Only 6% of the variation in the global sample is not captured by these 40 individuals. Even though only two individuals were sampled per population, and even though our estimates of within-population variance are upwardly biased (individuals were prescreened to avoid sequencing identical individuals; see Materials and Methods), our data clearly show that individual populations harbor much of the variation present species-wide. At the same time, there is strong population structure. Global geographic structure Studies of variation in A. thaliana have typically not found any correlation between the genotype and geographic origin of accessions. This has been attributed to a recent expansion of the species, perhaps in combination with human disturbance. However, early studies had little power to detect population structure, and a more recent, larger survey revealed weak isolation by distance [11]. Our study has several orders of magnitude more markers than any previous study known to us, and we find clear global population structure. We used a model-based clustering algorithm, implemented in the computer program Structure, to cluster our accessions on the basis of genotype [12]. Loosely speaking, the algorithm attempts to identify a predetermined number of clusters, K, that have distinctive allele frequencies, and assigns portions of individual genomes to these clusters. A genome can have membership in several clusters, and the algorithm reports the probability distribution of the assignment of each section of the genome. We analyzed the data by successively increasing K from two to eight (Figure 2). For K = 2, we see an East–West gradient, potentially attributable to post-glaciation colonization routes [11]. When we increaseK to three, all accessions from northern Sweden and Finland are assigned to a single cluster together with (to varying degrees) accessions from Eastern Europe, Russia, and Central Asia. While a relationship between northern Sweden and Tajikistan may seem far-fetched, several species are known to have colonized the Scandinavian Peninsula from both north (from Russia via Finland) and south (from Europe via Denmark) after the last glaciation [13]. As we increase K from three to eight, each new cluster splits a previously existing one along plausible geographical boundaries (and identifies some accessions as mixed). Thus K = 4 separates central European (Czech, Austrian, and Croatian) accessions from the main European cluster, K = 5 separates a subset of the United States accessions from the rest of Western Europe, K = 6 separates the Central Asian and Russian accessions from northern Sweden and Finland, K = 7 separates many German and southern Swedish accessions from the rest of Europe, and K = 8 identifies a Catalan cluster. Clusters identified for K > 8 did not contain the majority of any single individual genome. When these results are superimposed on a map (Figure 3), the pattern of isolation by distance becomes obvious. Individuals are, by and large, more similar to individuals that grow nearby than to individuals from far away. Although A. thaliana commonly occurs as a weed and human commensal, this has not been sufficient to erase population structure. Population structure is also evident in the distribution of pairwise differences between individuals. In the absence of population structure, all individuals should be equally closely related on average. As shown in Figure 4, this is clearly not the case. Not only is there generally a wider range of variation than would be expected in the absence of population structure, but there are also clear outliers. Some individuals are extremely closely related: these are typically from the same local population (and will be further discussed below). Perhaps more surprising is that two stock center accessions, Cvi-0 (Cape Verde Islands) (cf. [14]) and Mr-0 (Italy), are very different from all others (and each other). The distribution of pairwise differences can be conveniently summarized using hierarchical clustering. The population structure revealed by the resulting tree (Figure S1) generally agrees with the output of Structure. Variation within and between populations Because it is highly selfing, A. thaliana has often been considered a collection of asexual lineages, or “ecotypes.” This view is completely false. Indeed, even the first sequencing survey of variation showed clear evidence of recombination between accessions [15], and a subsequent study has shown that recombination has generally been sufficient to erode genome-wide LD on a very fine scale [16]. Consequently, there is no “phylogeny” of ecotypes. However, this still leaves open the possibility that local population structure is tree-like, with individuals within the same population being much more closely related to each other than to individuals from other populations. Indeed, since A. thaliana is a selfer, it is perfectly possible for a local population to consist of a single inbred sibship. We find that this is typically not the case; most sampled populations were polymorphic (even though this part of our study had relatively little power; see Materials and Methods), and when we consider the genome-wide data, the pattern that emerges is far from tree-like. As shown in Figure 2C and 2H, only a small fraction of all sequenced fragments have patterns of polymorphism compatible with monophyly in the sense that, with respect to that fragment, the members of a particular cluster are all more closely related to each other than to members of other clusters (the yellow “US” cluster is an exception to which we return below). Instead, just as is the case for human populations [17], most polymorphisms are shared between clusters, and levels of polymorphism are in general comparable within all clusters (see Figure 2E; for clarity, only values for K = 3 are shown). The same is true with respect to the local populations, although there is great variation between regions. An interesting way to describe the relationship between individuals is to try to identify chromosomal regions shared identical by descent by looking for long identical haplotypes [18–22]. The resulting patterns clearly reveal the difference in structure between geographic regions (Figure 5). In northern Sweden, pairs of accessions sampled from the same population are invariably more closely related to each other than to accessions from other populations; however, even here populations are far from monophyletic in the sense used above (i.e., for a particular locus, the closest relative may well come from a different population). In most regions (exemplified in Figure 5 by central Europe), haplotype sharing is moderate, and is not much greater within than between populations. The US sample is again different: here, pairs of accessions often appear to share entire chromosomes, and are equally likely to do so between populations. Regional variation in the level of population structure is also evident in the distribution of pairwise differences (see Figures 4 and S1). Individuals that are extremely closely related (less than ten differences) are almost always pairs from the same local population (two pairs from northern Sweden, one pair from Finland, and one from Germany). The one exception is a trio of nearly identical individuals from different US populations. Figure 2D and 2I illustrate the variation in FST across the genome for K = 3 and K = 8. This pattern is of interest in that regions with extremely high or low values may be seen as candidates for harboring selectively important loci [23,24]. Structure summarized The picture that emerges is that of a single, large, globally distributed population with historical gene flow sufficient to ensure that variation is shared worldwide, yet limited enough to cause considerable population structure. Genetic exchange is not only geographic; there has been enough outcrossing to ensure that LD decays within 25–50 kb on average (Figure 6), which is comparable to what has been observed in humans. All of this may seem surprising given the highly selfing nature of A. thaliana, but it is in fact completely compatible with theoretical predictions [16,25]. The only exception to this pattern comes from the US. Our sample from the US Midwest is clearly a heterogeneous collection, characterized by genome-wide LD and haplotype sharing. Especially notable is extensive haplotype sharing with accessions from other regions, in particular with the United Kingdom. All this strongly suggests that A. thaliana is a recent human introduction to the New World, and that the introduction severely reduced haplotype variation through bottleneck effects, causing genome-wide LD [16]. Since we have population samples from only the Midwest, we cannot rule out the possibility that the pattern is different in other parts of the US; however, we note that our one non-Midwestern US accession, Yo-0 (from Yosemite, California), is almost identical to some of our Midwestern accessions, and that a recent survey of variation in 53 US populations found no evidence of differentiation across the continent [26]. It should be emphasized that we view both Structure and hierarchical clustering as tools for exploring the data. The results should not be taken literally. For example, we do not believe that there are K = 8 random-mating populations in A. thaliana, as might be suggested by Figures 2 and 3, nor do we believe that populations are related in a tree-like manner, as might be suggested by Figure S1. Global Patterns Allele frequency distribution Note that θ̂S is consistently higher than θ̂P (see Figure 1). This is typically caused by an unusually high ratio of rare to common alleles, compared to what is expected under standard population genetics models [27]. A closer look at the distribution of allele frequencies reveals that this is indeed the case (Figure 7A). The observed distribution is skewed toward rare alleles compared to standard neutral expectations. A possible explanation for this is recent population growth [28]; however, the skew is much greater for nonsynonymous than for synonymous polymorphisms, suggesting that selective factors must also be involved. The effect of this genome-wide deviation from standard neutral models on “tests of selection” can be dramatic. These tests typically assume that the standard neutral model describes most of the genome, and interpret deviations at particular loci as signs of selection [3]. Our results show clearly that this procedure is not appropriate for A. thaliana (see also [29]). It is, of course, not a new finding that demographic history can invalidate tests of selection, but our results provide a striking illustration of the potential seriousness of the problem. For example, the mean value of one popular statistic, Tajima's D [27], is −0.8 rather than the (approximately) zero expected under simple neutral models, and the variance is also larger than predicted (Figure 7B). Positive values of Tajima's D are typically attributed to balancing selection: 2% of our fragments are significantly positive at the 1% level. Negative values are typically attributed to directional selection: 15% of our fragments are significantly negative at the 1% level. Although some of these deviations may, of course, actually be due to selection, our data suggest that tests based on standard cutoff values are anticonservative in both tails of the distribution. Consistent with this interpretation, a much higher fraction of studied genes have been reported to be under selection in A. thaliana than in other species [30]. Variation in the level of polymorphism While it is straightforward to fit a neutral model with growth to the observed distribution of Tajima's D (Figure 7B), the value of this exercise is doubtful. First, it is clear from Figure 7A that selection must be part of the reason for the skewed allele frequency distribution. Second, a model with growth would, in fact, fit other aspects of the data less well. In particular, population growth tends to reduce the variability in coalescence times across the genome compared to models with constant population size, resulting in less variation in the level of polymorphism between loci. We see the opposite: the variance between loci is considerably greater than expected under a standard neutral model with constant population size (Figure 7C). In addition, the distribution is heavily skewed and displays a long tail of extremely high values. There are several reasons to expect a poor fit to a simple neutral model. One is variation across the genome in the “neutral” mutation rate, θ, either due to variation in the level of selective constraint or due to variation in the actual, underlying mutation rate. Since the excess variability is observed equally for coding and noncoding DNA, we would have to invoke the latter. The extent to which variation in the mutation rate contributes to the pattern in Figure 7C can be estimated as soon as divergence data from a closely related outgroup species become available. Another factor likely to contribute to the variation in the level of polymorphism is population structure such as that described in the first part of the paper. It is well known that population structure can inflate the variance of coalescence times as well as induce a tail of very large values that would result in patterns of variability such as the one observed [6]. However, strong population structure is generally expected to push the distribution of Tajima's D toward positive values, which is the opposite of what we observe. It is possible that some model that involves growth (perhaps in conjunction with bottlenecks), structure, and finite sampling [7] could explain the pattern observed in A. thaliana, but we have not been able to find such a model (see also [29]). Genomic patterns of polymorphism It turns out that the pattern of polymorphism is affected by not only demographic forces, but also factors intrinsic to the genome. Figure 7D shows that polymorphism in noncoding regions is negatively correlated with local gene density. This mirrors the positive correlation between polymorphism and local recombination rates that was first noted in Drosophila [31] and has since been observed in a wide range of organisms [32]. A possible explanation is that recombination itself is mutagenic: this appears to explain the correlation observed in humans, but is not sufficient to explain the phenomenon in general [32,33]. Instead, it has been proposed that variation is reduced in low-recombination regions because of a “hitchhiking effect” due to selection on linked sites [34,35], in the form of either positive selection (“selective sweeps”) [36] or purifying selection (“background selection”) [37]. Such hitchhiking effects would be stronger in low-recombination regions because sites in these regions are affected by selection on larger pieces of the chromosome (i.e., more genes). Recombination is thus used as a proxy for gene density: the real prediction of these models is that polymorphism should decrease with gene density. This is precisely what we observe. The level of polymorphism is insignificantly positively correlated with recombination (suggesting that although recombination may well be mutagenic, this does not explain the phenomenon), but is strongly negatively correlated with gene density. Two factors suggest that background selection rather than selective sweeps is responsible for the correlation. First, unlike background selection, selective sweeps are expected to skew Tajima's D toward negative values. However, we find no correlation between Tajima's D and gene density. Second, it is clear from Figure 7A that a significant load of deleterious mutations (as is required by background selection) exists in A. thaliana. Figure 7E reveals that polymorphism is also positively correlated with segmental duplication, similar to what has been observed in humans [38–40]. In humans, the phenomenon appears to be largely due to misclassification of paralogous sequence variants as SNPs. Distinguishing between paralogous sequence variants and true SNPs is difficult for human data, where putative SNPs are typically detected in small samples of highly heterozygous individuals. In contrast, our data consist of high-quality sequences from a large sample of almost completely homozygous individuals, and we are therefore confident that nearly all of our polymorphisms are genuine (see Materials and Methods); fragments that have a close match elsewhere in the genome thus appear to be more variable than fragments that do not. We hypothesize that this is caused by a low level of intergenic gene conversion that serves to “shuffle” variation between loci. Such gene conversion has long been known to occur in large multigene families (“concerted evolution”; [41]): our results suggest that it may be a general phenomenon. Recombination and gene conversion We noted above (see Figure 6) that LD decays within 25–50 kb, somewhat faster than has previously been suggested [16]. At least 25% of our sequenced fragments show evidence of recombination (using the four-gamete test; [42]). Estimates based on coalescent models suggest an effective population recombination rate (e.g., [6]) of approximately ρ = 2 × 10−4 per basepair (V. P., B. P., P. Marjoram, J. W., and M. N., unpublished data). Given our estimates of the mutation rate θ (see Figure 1), this implies a ratio θ/ρ of about 20. The short-range pattern of LD in several species is incompatible with the long-range pattern; there is too little of the former relative to the latter for a simple recombination model that includes only crossing over to explain the data [43–48]. Possible explanations include gene conversion and multiple mutations (i.e., each SNP not being due to a unique mutation event), both of which will erode short-range LD [46]. There is clear evidence for both phenomena in our data. We observed a total of 315 tri-allelic SNPs. Since less than 50% of all multiple-hit mutations will result in more than two distinct alleles, this suggests that a total of more than 600 of our SNPs are, in fact, the product of multiple mutations. We also observed three fragments that show clear evidence for gene conversion in that a single gene conversion event (i.e., a double cross-over within 500 bp, as would result from the resolution of a single Holiday junction) suffices to explain a complicated pattern of polymorphisms based on multiple SNPs in the fragment. Coalescent-based analyses based on the fine-scale pattern of polymorphism suggest that gene conversion is about five times more common than crossing over, in agreement with previous population genetic analyses [48], as well as with direct estimates based on tetrad analysis [49]. Concluding Remarks We have shown that the pattern of polymorphism in A. thaliana, a selfing human commensal, generally agrees with what would be expected for a widely distributed sexually reproducing species. Although there is significant population structure, polymorphism is shared worldwide. As predicted by population genetics theory [25], the only clear indications of selfing in the pattern of polymorphism are that individuals are typically homozygous, and that LD is unusually extensive. The scale of our study allows us to consider the genomic distribution of statistics commonly used to summarize polymorphism data. We find that these distributions generally deviate significantly from what is assumed by standard population genetics models. This highlights the danger of using highly parameterized models based on untested assumptions for inference in population genetics. Commonly used “tests of selection” are simply not valid in A. thaliana (cf. [29,30]). Large-scale analyses in other organisms have similarly found genome-wide deviations from standard models (e.g., [43,50,51]). As data continue to accumulate, the focus of population geneticists will surely have to shift from rejecting null models that do not fit particular loci to finding models that actually do explain the bulk of the data. Genomic polymorphism data are required to develop more robust inference methods, and will enable us to study phenomena that are intrinsically genomic (e.g., the correlations in Figure 7D and 7E). More importantly, however, these data will help identify the functional polymorphisms that underlie phenotypic variation. The pattern of polymorphism in A. thaliana, characterized by humanlike levels of LD but much higher SNP density, coupled with the availability of naturally occurring inbred lines, makes the species ideal for LD mapping. Although the strong population structure is likely to cause a high rate of spurious genotype–phenotype associations, these problems can easily be overcome through direct experimental verification using crosses or transgenics. This significantly strengthens the position of A. thaliana as a model for evolutionary functional genomics. Materials and Methods Sampling The sample of 96 individuals included pairs of individuals from 25 local “populations” (typically sampled within a few hundred meters of each other, often much closer) as well as a worldwide survey of commonly used stock center accessions (Tables S1 and S2). Where possible, four populations were sampled from each of several regions. The sample was generated by screening a larger set of accessions with a small number of markers to avoid inbred siblings or extensively heterozygous individuals (E. B., E. Stahl, C. T., M. N., M. K., and J. B., unpublished data). Accessions were genotyped using 11 unlinked markers (five microsatellites, two indel R-genes, and four housekeeping genes with previously identified polymorphisms). To ensure that individuals sampled from local populations were not part of inbred sibships, four (three in one case) individuals from each of 37 populations were tested. Polymorphism was found in 25 of these populations, and a pair of nonidentical individuals was selected at random from each (Table S1). Some accessions not from the same population were also found to be identical with respect to these markers (Col-0 and Lp2-2; Ts-1 and Shahdara), but these were included nonetheless. Five accessions were found to be heterozygous and were eliminated. Four of these were from the population samples, and one, Ms-0, was from the stock center. Further testing of two additional Ms-0 lines revealed one more heterozygote and one homozygote, which was included. In spite of these precautions, one sequenced stock center accession, Van-0, turned out to be extensively heterozygous and was eliminated from the analyses in this paper (bringing the sample size to 95). Data generation We used direct, PCR-based sequencing of genomic DNA, with primers designed from the A. thaliana reference sequence to cover the genome relatively uniformly. To achieve uniform density of our fragments, the reference genome (releases January 7, 2002, and April 17, 2003) was first divided into equally spaced regions. The last 10 kb of each region then served as an input record to Primer3 (v. 0.6). The designed primer pairs returned from Primer3 for each region were then screened for uniqueness and quality. To screen for uniqueness, all primer pairs were BLASTed (BLAST v. 2.2.3) against both the reference genome as well as BAC datasets downloaded from the Arabidopsis Information Resource (http://www.arabidopsis.org/). Any primer pair that produced a hit in the same region (≤2,300 bps) was removed. Self-amplifying primers were also removed based on this same criterion. Additionally, primers with more than five BLAST hits against the reference were also discarded. To improve the quality of each fragment, any primer pair that amplified a target sequence that contained a homonucleotide run of nine bases or more was removed. All sequencing was done using ABI 3700 automated sequencers (Applied Biosystems, Foster City, California, United States). All fragments were sequenced in both directions. Chromatograms were initially base-called with Phred (v. 0.020425.c) and trimmed based on quality value. The start and end of each read was trimmed until the average quality value was 25 in a window of ten bases, and internal bases were converted to missing data when their quality value was below ten. Accessions missing one read of data were trimmed more severely (different setting were used). A combination of Phrap (v. 0.020425.c) and ClustalW (v. 1.82) was used for producing alignments using a modified weight matrix that allowed us to incorporate quality values into the ClustalW algorithm. Alignments were then visually inspected and adjusted as necessary using Consed (v. 13.0). Polyphred (v. 4.20) was used to flag potential heterozygotes, which were confirmed by visual inspection of chromatograms. Additional trimming was performed as necessary for accessions with multiple false polymorphisms and low-quality sequence after a visual inspection of chromatograms. Whenever two reads from the same accession disagreed, the final call was made by visually inspecting chromatograms unless the difference in quality value made the final call obvious. Potential polymorphisms in each alignment were then verified by a second person. All alleles found only once or twice in the sample were verified by visually inspecting the chromatograms. When this inspection did not reveal a chromatogram peak clearly different from the other accessions, the base was changed to missing data. This would, if anything, produce a slight underestimate in alleles of frequency one and two. Higher-frequency polymorphisms with generally low-quality values (20 or lower) were also verified by checking the chromatograms. A total of 876 high-quality fragment alignments were obtained from 979 PCR primers and used for the analyses in this paper. Of the remaining PCR primers, some failed at the stage of PCR amplification and sequencing, while some produced sequencing output that could not be base-called with certainty when the sequence quality was particularly low or when there was evidence that the primer pairs amplified two or more different products in some of the accessions. To calculate genetic distances, we used a set of markers that have been genetically mapped to the Lister and Dean recombinant inbred lines and that also can be mapped to the AGI reference genome. Some markers were removed so that both physical position and genetic position were monotonically increasing functions. All data are publicly available through our Web site (http://walnut.usc.edu/2010, and also as Dataset S1. Population structure To infer population structure and assign accessions to populations, we used a model-based clustering algorithm implemented in Structure v. 2.0 [12]. Since A. thaliana is largely homozygous, we used a haploid setting. We used the “linkage model” with “correlated allele frequencies” in Structure, where genetic distances (calculated by fitting a third-order polynomial to the Lister and Dean recombinant inbred mapping data) were used to indicate locus proximity. The algorithm was run with a burn-in length of 50,000 MCMC iterations and then 20,000 iterations for estimating the parameters. This was repeated ten times for each K (ranging from one to 17). In these analyses, each fragment-haplotype was treated as a marker at a multiallelic locus, so that two accessions had a different type if they differed at any site in the fragment. The likelihood of the data increases with K from K = 1 until K = 7 (using the Wilcoxon two-sample test to compare the ten runs for each K; two-sided p = 0.001 for K = 7 versus K = 6). The likelihoods of K = 7 and K = 8 were similar (two-sided p = 0.97). For K > 7, the likelihoods of different runs were more variable than for K ≤ 7, with the added variability caused only by runs with lower likelihoods. Moreover, the additional clusters for K > 8 do not have a majority of the genome for any of the accessions. These observations taken together indicate that it is less meaningful to choose K > 8. In displaying the output from Structure, we computed an average of the ten runs for each K. Because there are K! distinct permutations of the clusters that all correspond to equivalent assignments of membership coefficients to accessions, and because independent runs may produce different permutations, to compute an average we first permuted the clusters to align the solutions. For R runs, there are (K!) R − 1 ways of aligning clusters across runs. To determine which of the clusters of each of the other runs corresponds to a specific cluster in a given run, the symmetric similarity coefficient (SSC) was used with the matrices of membership coefficients (based on the genome-wide average). For a given K, the SSC was calculated for all combinations of pairs of runs: where Q i and Q j are the membership matrices of runs i and j (i ≠ j), P is a permutation; the minimum is taken over all permutations, S is a probability matrix of K columns where all elements equal 1/K, and A F is the Frobenius matrix norm [52]. This is a slight adaptation of the asymmetric similarity coefficient used in previous work [17]. For K = 2, the runs were permuted to the arrangement that maximizes the sum of SSC across pairs of runs, and an average of the membership matrices across runs was then taken. For K > 2, it was not feasible to test all possible arrangements; therefore, the following greedy algorithm was used. (1) Fix a permutation, P 1, of one (randomly chosen) run, . (2) Randomly choose a second run, Q2, and fix the permutation, P 2, that maximizes . (3) Continue sequentially with each remaining run, Q x , where x = 3,…, R, and fix the permutation, Px , that maximizes for the current run, Q x . Because the choice of starting run can affect the result, we tested all ten possibilities for the starting run. For K = 2 to K = 8, there were thus 70 possible ways of starting the algorithm, and in only two of 70 possible cases was a different result obtained. These two solutions differed from the common solution by switching one pair of clusters in one run (2.5% of the clusters differed from the common solution), and switching one pair of clusters in two different runs (5%). We tested for monophyly as follows. For every variable site in a fragment, each cluster was checked for the presence of both alleles as well as for the presence of both alleles outside the cluster. If a variable site in a fragment had both alleles within the cluster as well as outside the cluster, then the whole fragment was deemed nonmonophyletic for that specific cluster. Clusters that failed to show nonmonophyly for a fragment were considered monophyletic for that fragment. Fragments with less than five variable sites and clusters with less than five accessions were always considered to be nonmonophyletic. FST for the inferred clusters was computed as: where θ̂P, total is the average number of pairwise differences per site for all pairs of accessions, and θ̂P, within is the average number of pairwise differences per site for all pairs within cluster i. Of the total of 95 accessions, 40 were hierarchically sampled in pairs from four populations in each of five regions (Table S1). The total amount of variation among these 40 accessions, θ̂P, among40 , was computed by taking the total average pairwise difference for all pairs of the 40 accessions, whereas the amount of variation within populations, θ̂P, withinpop , was calculated by taking the mean of the total average pairwise difference for the pairs of accessions in the 20 populations. The level of variation among geographical regions, θ̂P, amongreg , was computed as the difference between θ̂P, among40 and the mean of the total average pairwise differences for all pairs of accessions within regions. The level of variation among populations, θ̂P, amongpop , was calculated from the following expression: Genomic patterns of polymorphism Correlations were identified between levels of polymorphism and local gene density or degree of duplication (Figure 7). The local gene density was measured as open reading frames per centimorgan in windows of size greater than or equal to 1 Mb (using genetically mapped markers from the Lister and Dean recombinant inbred data as endpoints). The number of open reading frames (excluding pseudogenes and RNA genes) from the annotated reference sequence that fell between these window endpoints was counted, and length in centimorgans of each window was estimated from the genetic distance of the markers used as window endpoints. Correlations were quantified using Spearman's rank correlation, and the significance of the observed values was evaluated using 50,000 permutations that maintained the chromosomal order of all observations but that shuffled the relative positions of the two variables. (For each variable, the lists representing the consecutive values within each chromosome were concatenated in random order and direction to form a circle. The two circles were then randomly aligned with each other.) This is necessary to avoid inflated significance values due to autocorrelations along the chromosomes (of both variables). Using this procedure, the rank correlation between θ̂S in nonexon sequences and gene density is −0.27 (p = 0.0014), and the rank correlation between θ̂S in nonexon sequences and the negative log of the second-best BLAST e-value is 0.13 (p = 0.0018). To investigate the effect of population structure, all analyses (except those of population structure) were repeated with the outliers in Figure 4 removed (Cvi-0, Mr-0, and all but one randomly chosen member of each closely related group). All conclusions remain qualitatively the same. Supporting Information Dataset S1 All Data Used in the Paper (912 KB ZIP). Click here for additional data file. Figure S1 Hierarchical Clustering of Individuals Based on Pairwise Differences (12 MB EPS). Click here for additional data file. Table S1 The Population Samples Used in the Project (8 KB PDF). Click here for additional data file. Table S2 The Individual Accessions Used in the Project (8 KB PDF). Click here for additional data file.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            2b-RAD: a simple and flexible method for genome-wide genotyping.

            We describe 2b-RAD, a streamlined restriction site-associated DNA (RAD) genotyping method based on sequencing the uniform fragments produced by type IIB restriction endonucleases. Well-studied accessions of Arabidopsis thaliana were genotyped to validate the method's accuracy and to demonstrate fine-tuning of marker density as needed. The simplicity of the 2b-RAD protocol makes it particularly suitable for high-throughput genotyping as required for linkage mapping and profiling genetic variation in natural populations.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Genome-Wide Patterns of Nucleotide Polymorphism in Domesticated Rice

              Introduction Domestication is a complex, cumulative evolutionary process in which human use of organisms leads to morphological and/or behavioral changes distinguishing domesticated species from their wild ancestors [1,2]. Beginning with Charles Darwin [3,4], there has been strong interest in the study of domestication of crop species as a means of understanding the nature of selection. Moreover, domestication and the development of agriculture are arguably the most important technological innovations in human history [5]. Crop plant domestication was the linchpin of the Neolithic Revolution 10,000–12,000 years ago, in which hunter-gatherer groups transitioned into sedentary agricultural societies that gave rise to current human cultures [6]. With domestication came the availability of food surpluses, and this agricultural development led to craft specializations, art, religious and social hierarchies, writing, urbanization, and the origin of the state [5]. One of the earliest domesticated crop species is cultivated Asian rice, Oryza sativa L., which has become the world's most widely grown crop and has also assumed the stature of a key model system in plant biology. Rice consumption constitutes about 20% of the world's caloric intake, and in Asian countries, where over half of the world's population lives, rice often represents over 50% of the calories consumed [7]. Because of its small genome size, rice has been the first crop plant to have its whole genome sequenced [8–10]. A wealth of morphological, physiological, and ecological variation exists within cultivated Asian rice, reflected in the large number of recognized cultivars or strains [11,12]. Two main rice varietal groups, O. sativa indica and O. sativa japonica, have been recognized since ancient China [13]. Although phenotypic distinctions between these groups is not always straightforward, indica varieties tend to be found throughout the tropical regions of Asia and are primarily grown in lowland conditions, while japonica types are differentiated into tropical japonica, distributed in upland tropical regions, and temperate japonica, a recently derived group cultivated in temperate regions [11,13,14]. Additional variety groups include aus, drought-tolerant rice from Bangladesh and West Bengal, and aromatic, fragrant rice from the Himalayan range [14,15]. All rice varieties have a predominantly self-fertilizing mating system [13]. Both morphological and isozyme data have established that O. rufipogon Griff., a partially outcrossing species native to southern Asia, is the wild ancestor of domesticated rice [13]. In this paper, we describe the levels and patterns of DNA sequence polymorphism across the rice genome and that of its wild ancestor, O. rufipogon. To our knowledge this is the first genome-wide characterization of sequence variation in domesticated Asian rice, and we show that rice contains a unique pattern of excess high-frequency derived single nucleotide polymorphisms (SNPs) that has not been reported in other species. We develop four models to explain patterns of genetic variation in O. sativa and O. rufipogon, including a simple selectively neutral bottleneck model that has been previously thought to be the dominant demographic force shaping levels of nucleotide variation in crop species. We demonstrate that this simple bottleneck model is inadequate to explain the origin of domesticated rice. We conclude that either positive selection has made a significant impact on genomic polymorphism patterns, or that domestication involved an extremely severe bottleneck (∼99.5% reduction) coupled with gene flow among modern varieties and between domesticated rice and its wild ancestor. Results/Discussion Nucleotide Variation in the Rice Genome To assess levels and patterns of polymorphism in the rice genome, we sequenced one hundred eleven randomly chosen gene fragments (sequence-tagged sites or STS) in a diverse panel of Oryza accessions, including 72 from O. sativa and 21 from O. rufipogon (Tables S1 and S2). Average silent (synonymous and noncoding) site nucleotide diversity (θπ) across all sampled loci in O. sativa is approximately 3.20 × 10−3 (Table 1). Levels of polymorphism in the wild ancestral species, O. rufipogon, are predictably higher than rice, with a mean silent θπ of 5.19 × 10−3 (Table 1). These levels of polymorphism are lower than those observed for maize, a domesticated outcrossing species [16], and Arabidopsis thaliana, a selfing, wild species [17,18]. Table 1 Average Diversity Measures in Oryza spp. across 111 STS Regions To determine if any genetic differentiation due to population structure among rice groups is evident in these STS sequences, we used the Bayesian clustering program STRUCTURE [19]. The highest likelihood obtained was with a model specifying K = 7 groups (Figure 1; Table S1). Five groups occur within O. sativa and correspond to the traditional variety designations, as described previously [14]. Evidence of some limited geographical population structure is also observed in O. rufipogon (Figure 1; Table S1). Neighbor-joining analysis of the concatenated STS sequences (Figure S1) revealed two distinct clusters within cultivated rice; one comprises a tropical japonica, temperate japonica, and aromatic rice lineage, and another consists of aus and indica rice. The apparent monophyly of these major groups is consistent with at least two domestication events in rice [14,20–24]. The nesting of the aromatic and the temperate japonica variety groups within tropical japonica suggests the first two groups originated from secondary divergence events from the latter, although the lack of support for tropical japonica branches does not exclude other possible divergence scenarios (Figure S1). Indica and aus relationships, on the other hand, are consistent with rapid divergence after domestication or separate domestication events from the same ancestral gene pool. Within-group SNP levels of cultivated rice are lower than those of the whole species (Table 1), with subpopulations harboring between 19% (temperate japonica) and 43% (indica) of the polymorphism of O. rufipogon. Assuming separate domestication events, the japonica clade contains 42% and the indica clade contains 48% of the diversity levels found in O. rufipogon. Figure 1 Estimated Population Structure for 97 Accessions of O. sativa and O. rufipogon from 111 STS Loci Vertical bars along the horizontal axis represent each Oryza accession; for all accessions, the proportion of ancestry under K = 7 clusters that can be attributed to each cluster is given by the length of each colored segment in a bar. The Derived Site-Frequency Spectrum is U-Shaped in O. sativa Because of the strong population structure evident in our rice sample, it is necessary to assess patterns of variation separately for each group when making inferences about the evolutionary dynamics of domestication. Indica and tropical japonica represent the most widely grown cultivars for each of the separate domestication events, and we limited our characterization of polymorphism patterns to these two groups. We examined the frequency spectrum of segregating sites within loci using Tajima's D [25], and found that O. rufipogon and the two main rice subspecies show an excess of rare alleles, as evidenced by the biased distribution of Tajima's D toward negative values (Figure S2; Table 1). Crops are expected to have gone through a population bottleneck during domestication, as only a limited number of founding individuals were brought into cultivation. The distribution of Tajima's D in the domesticated rice varieties is inconsistent with a recent bottleneck, however, as these should reduce levels of low-frequency variants and bias measures of Tajima's D toward positive values. It is possible that subsequent population expansion, due to the spread of rice agriculture, could be responsible for the over-representation of rare alleles segregating in domesticated rice varieties, or selection may have played a role. We further examined the derived site-frequency spectrum across SNPs (i.e., the fraction of derived polymorphisms present at various frequencies within a group) in indica and tropical japonica. To infer ancestral alleles for each SNP, we used as an outgroup O. meridionalis, a species believed to have diverged from O. sativa ∼2 million years ago [21]. In each O. sativa variety we observed a large number of high-frequency derived mutations (i.e., derived SNPs above 70% frequency in the population) leading to a U-shaped frequency distribution (Figure 2); this type of pattern has not been reported at the genomic level in any other species. Figure 2 The Observed Marginal Derived Site-Frequency Spectra of Noncoding and Synonymous SNPs for Two Population Pairs: indica and O. rufipogon and tropical japonica and O. rufipogon To accommodate SNPs with missing data, all spectra are plotted as the expected site frequency spectrum in a subsample of the data of size n = 16. Possible explanations for the excess of high-frequency derived SNPs in O. sativa include the misidentification of ancestral states due to shared polymorphism with O. meridionalis, or the occurrence of multiple mutations at given sites since divergence from O. meridionalis. However, both misidentification of derived alleles and multiple hits would be expected to also affect the site-frequency spectrum of O. rufipogon, which is not observed (Figure 2). This suggests that the O. sativa derived site-frequency distribution is a result of the domestication process. Furthermore, derived alleles at high frequency in the O. sativa varieties occur primarily at low to intermediate frequency in O. rufipogon, suggesting that such alleles have only recently increased in frequency (Figure S3). We also checked the ancestral state calls in O. sativa using the African wild rice O. barthii. Although O. barthii is more closely related to O. sativa than is O. meridionalis, if we assume that both wild species share ancestral polymorphisms with domesticated rice, the possibility that we always identified the same alternative allele as derived in our sample should be low. Using this approach, we find that 88% of our ancestral SNP calls in indica and 86% in tropical japonica matched in O. barthii and O. meridionalis. Even when using only the matched calls (which is a very conservative criterion, since it does not take into account drift and/or fixation processes in O. barthii), the site frequency spectrum in O. sativa varieties remains U-shaped. An excess of high-frequency derived SNPs is often interpreted as a result of genetic hitchhiking during recent selective sweeps [26]. Because the site-frequency spectrum in rice varieties is observed from randomly selected loci, and the loci contributing high-frequency derived SNPs are distributed across the genome (Figure S4), this pattern suggests that strong linkage to positively selected mutations occurred within most of the genome. However, demographic forces may have also played a role in shaping the rice genomes. We developed several demographic models and a multiple selective sweeps model to test which evolutionary processes may best explain the observed patterns of polymorphism in rice. Demographic Models for Rice Domestication: A Neutral Population Bottleneck Model The most widely accepted demographic model for crop domestication is a neutral bottleneck model [27–29]. In this model, rice domestication is assumed to be a result of recent population divergence, with one of the two daughter populations experiencing a reduction in population size at divergence associated with the founder effect at the time of domestication, followed by population growth as cultivation of the crop increases. To fit this model to our data, we used a diffusion-based approach [30–32] to predict the pattern of allele frequencies in domestic and ancestral populations under selective neutrality. Details of the inference procedure can be found in the Materials and Methods section. The composite-likelihood function we employed uses the reduction in diversity observed in either of the domesticated rice subspecies and the shift in allele frequency distribution to estimate four parameters: the time back until the start of domestication (τ1), duration of the bottleneck (τ2), ratio of current population to ancestral population size (ν2), and relative size of the bottleneck population to the ancestral population (νb). The duration of the bottleneck was assumed to be 25% of the time back until domestication (τ2 = 0.25 × τ1), which is consistent with archeological data suggesting it took ∼3,000 y from the time of initial cultivation (∼12,000 y ago) until the appearance of domesticated rice grains [33,34]. Bottleneck parameter estimates for indica and tropical japonica are broadly comparable, with a slightly more severe bottleneck in tropical japonica (Table 2). Assuming the time back to the beginning of domestication for both variety groups was ∼12,000 y [35], we can independently derive estimates of the current O. rufipogon effective population size, N rufi, using the relationship τ1 × 2N rufi = 12,000 (because τ1 is scaled by 2N rufi). From the indica analyses, N rufi is equal to 12,000/(2 × 0.1044) = 57,471, and from the tropical japonica analyses is equal to 12,000/(2 × 0.0508) = 118,110 (this exact value of N rufi is important in scaling all of the estimated parameters into years and number of individuals). The indica-derived N rufi estimate implies bottleneck and current estimated population size (N e) for indica of (νb × Nr ufi) = 1,413 and (ν2 × N rufi) = 40,229 respectively. The second estimate suggests a bottleneck and current N e sizes for tropical japonica of (νb × N rufi) = 1,334 and (ν2 × N rufi) = 46,889, respectively. Table 2 Maximum Likelihood Estimates for Demographic Parameters of the Bottleneck and Bottleneck plus Migration Models in indica, tropical japonica, and O. rufipogon The differences in estimates of N rufi from each analysis could be attributable to differences in the founding population of each variety group or differences in the timing of each domestication event. We note, however, that a bottleneck model conditioned on coincident domestication for indica and tropical japonica (equal τ1 values) differs only by 1.8 log likelihood units (unpublished data), suggesting that equal timing of domestication is likely to have occurred. An independent estimate of N rufi can be found by using the estimated scaled population silent mutation rates (θW = 4N rufi μ = 5.42 × 10−3 per bp; Table 1) and the observation that the O. rufipogon site-frequency spectrum is consistent with that of a population of long-term constant size (Figure 2). Assuming a neutral mutation rate of 10−8 per bp, yields a point estimate of N rufi = 135,500, which is slightly higher, but close to the estimates found by conditioning on the start of domestication. Demographic Models for Rice Domestication: A Complex Model Incorporating Subdivision, Bottlenecks, and Migration It is important to note that population bottlenecks alone would not generate the strong excess of high-frequency derived alleles and strong U-shaped site-frequency spectrum observed in O. sativa (Figure 2) [36]. In order to explain this aspect of the data, we considered several demographic models that included ancient subdivision in the ancestor of rice, a bottleneck at the time of domestication for each domesticated varietal group, and limited gene flow between the independently domesticated rice groups indica and tropical japonica. Ancient, strong subdivision is not evident in our O. rufipogon sample (Figure 1); F st between Chinese and non-Chinese O. rufipogon is low, about 0.16, and no interior modes are evident in the site-frequency spectrum of O. rufipogon, as expected under subdivision. However, it is possible for limited gene flow in O. rufipogon to lead to some differentiation of allele frequency between groups, but not so much that it would have a strong effect on a combined O. rufipogon sample. Furthermore, the population bottlenecks induced by independent domestication events could amplify any allele frequency differentiation between indica and tropical japonica, and limited gene flow between these two groups could introduce ancestral alleles into each population, causing mutations previously fixed in one group to be observed as high-frequency derived alleles in the other. To test the effect of ancestral population substructure within O. rufipogon prior to the domestication of the two O. sativa groups, we fit the parameters of a complex demographic model to our data using a composite likelihood technique (see Materials and Methods). We began by exploring a model with seven demographic parameters, which consists of O. rufipogon being subdivided into two demes of equal size, sharing on average M R migrants per generation. Current-day indica varieties are descended from one of these demes, while tropical japonica varieties descend from the other. During the domestication process, each population underwent a bottleneck that began τ1 generations ago (in units 2N rufi) and had severity νb (the ratio of the reduced population size to the ancestral size). After τ2 = 0.25 × τ1 generations (∼3,000 y), both indica and tropical japonica partially recovered, instantaneously reaching a fraction υI and υJ of the ancestral size, respectively. Contemporary gene flow (since domestication) between tropical japonica, O. rufipogon, and indica is captured by the last parameter, the average number of migrants per generation between these demes (M S). This model was conceived because it incorporates key demographic features of rice or crop domestication (e.g., bottlenecks, two domestication events) and could conceptually generate the observed derived SNP site frequency spectrum. In preliminary analyses, we found that the migration rate (M R) between the two ancestral O. rufipogon demes was very large, with the marginal likelihood surface for this parameter near its maximum value whenever M R > 7. This is consistent with our observations of limited population structure in O. rufipogon (above), and we therefore discarded ancestral population structure as a main contributor to the patterns observed in our dataset, and simplified the demographic model to consider only a single ancestral population from with both indica and tropical japonica derive (with migration rates among the three remaining demes, M S = 4N rufi m). This assumption reduced the computational complexity, so that the remaining parameters could be estimated via a grid search using an initial size of over 2,000 points with 1,000,000 coalescent simulations per point. The resulting model (which we refer to as the bottleneck plus migration model) has five free parameters with composite maximum likelihood estimates of M S = 7.0 (migration between demes), νb = 0.0055 (domestication bottleneck size), νI = 0.27 (ratio of indica to O. rufipogon N e), νJ = 0.12 (ratio of tropical japonica to O. rufipogon N e), and τ1 = 0.04 (start of domestication in units of 2N rufi) (Table 2). It is important to note that coalescent simulations scale the migration based on population size, so the number of migrants entering into the tropical japonica population is smaller (0.5 × M × νJ = 0.42), than into indica (0.5 × M × νI = 0.945), and O. rufipogon (0.5 × M = 3.5). In Figure 3, we report the profile composite-likelihood contours for the three key demographic parameters in the bottleneck plus migration model: migration rate, start of the bottleneck, and severity. The figure is constructed by holding two parameters fixed at a given point in the (x,y) plane, optimizing over the third parameter, and reporting the maximum likelihood attained for the (x,y) point (due to computational limitations the figure was constructed holding the ratio of current-day indica and tropical japonica populations at their maximum composite-likelihood estimates). We note that the three parameters are moderately to strongly correlated, but only a restricted set of values in high dimensional space is consistent with the data. These solutions all include: a very strong bottleneck (>99% reduction), high rates of migration within and between domesticated and wild populations of Asian rice (M > 5), and current-day effective population sizes for cultivated rice that are substantially smaller than those seen in the ancestral population. We also note that the model solutions show a positive correlation between size of bottleneck population and timing of the bottleneck, a negative correlation between size of the bottleneck and migration, and a negative correlation between migration and timing (consistent with the ∼2-fold difference in the estimated time of the bottleneck between the model with migration and the model without). Figure 3 Contours of Composite Profile Log-Likelihood Surface under the Bottleneck and Migration (i.e., “Complex Demography”) Model for Three Key Demographic Parameters Parameters include bottleneck severity, migration rate among demes (4Nm), and τ1 (time back until start of domestication scaled in units of 2N rufi). The maximum composite-likelihood estimate of the parameters is denoted by a red filled circle. As can be seen in Figure 4, the expected site-frequency spectrum under the best fitting bottleneck plus migration model matches the observed frequency distributions fairly well for both O. rufipogon as well as indica, but not as well for tropical japonica. As expected, the total number of SNPs in each of the three populations is predicted quite well by the model. We quantified the fit of the model to the observed data using a modified Pearson Chi-square goodness-of-fit (GOF) statistic, and found that the best-fitting complex demographic model is an excellent fit to the marginal indica (GOFI = 20.26, p = 0.72) and O. rufipogon site-frequency spectra (GOFR = 7.57; p = 0.99), and an adequate fit to the tropical japonica site-frequency spectrum (GOFT = 37.83, p = 0.22). One interesting observation is that the demographic model underpredicts the excess of high-frequency derived alleles observed in tropical japonica—a potential indication of recent positive selection. Given that artificial selection was probably quite strong and frequent during and after domestication, we further explored models that incorporate selection during the domestication process of O. sativa. Figure 4 Observed and Expected Derived Site-Frequency Spectra under Various Models The observed derived site-frequency spectrum for (A) indica and (B) tropical japonica, along with the expected site-frequency spectrum under the simple bottleneck, bottleneck plus migration demography, and bottleneck plus sweeps models. (C) Observed site-frequency spectrum for O. rufipogon and expected frequencies using a standard neutral model and a bottleneck plus migration model. Selection Models for Rice Domestication Since strong selection is known to accompany crop domestication, we developed two alternative models incorporating multiple selective sweeps to explain the unusual polymorphism patterns in indica and tropical japonica. In a neutral locus linked to a single, recent selective sweep, let f i be the probability of observing a neutral mutation segregating at frequency i in a sample of size n, conditional on the locus being variable. An expression for f i has been derived [26] and further extended [37,38], and includes the genomic distance d (measured in bp) between neutral and selected loci, a compound parameter α, which represents the combined contributions of recombination, selection, and population size, and the “background” allele frequency distribution (i.e., the expected site-frequency spectrum for loci unlinked to a selected site). These results for a single sweep can be used to predict the site-frequency spectrum at randomly chosen loci if multiple sweeps have recently occurred. Assuming that selective sweeps occur at random positions in the genome at a density of κ sweeps per bp, the distance between a random neutral locus and the nearest sweep will be approximately exponentially distributed with mean 1/(2κ). Define the function φ i (d, α, κ) to be the probability of observing i copies of a neutral mutation in a sample of n chromosomes, given that a sweep occurred at a distance d bp away with compound parameter α [38], and background site-frequency spectrum q. By integrating over the distance between the sampled locus and the unknown target of the sweep, the marginal probability, P i, of observing a randomly chosen SNP at frequency i in a sample of n chromosomes is a function of κ, α, and q [38]: This probability can be used to calculate the composite likelihood of the data and estimate the parameters κ and α (see Materials and Methods). It should be noted that this equation assumes that the neutral locus is affected only by the nearest selective sweep. We considered two distinct models. The first is a model in which strong selection is the only force that has acted in domesticated rice populations, and uses the normalized O. rufipogon site-frequency spectrum as the background frequency distribution. The second, a bottleneck plus sweeps model, allows multiple selective sweeps to affect patterns of variation immediately following a population size change. The background site-frequency spectrum in the latter case can be approximated using the predictions of a simplified neutral bottleneck model. The bottleneck plus sweeps model incorporates the sweep density κ, the compound parameter α (the combined contributions of recombination, selection, and population size), and a bottleneck severity parameter ν. The likelihood surfaces for both the pure selection and the bottleneck plus sweeps model in rice each contains a long ridge where different parameter combinations have almost equally high likelihoods, implying that a model with high sweep density and relatively weak selection is just as likely as a model with low sweep density and strong selection (Figure 5). For both models, the ridge of maximum likelihood is shifted to the right in tropical japonica, indicating that for a given value of the selection severity parameter α, the sweep density in tropical japonica is estimated to be twice that in indica. Figure 5 Composite Likelihood Surfaces in indica and tropical japonica under Models Incorporating Selection A density plot of the marginal composite log-likelihood surface of the parameters α and κ, with the bottleneck severity υ fixed to its estimate, under the bottleneck plus sweeps model for (A) indica and (B) tropical japonica. The composite log-likelihood surface of the parameters α and κ under the pure selection model for (C) indica and (D) tropical japonica. The composite log-likelihood is represented as a deviation from the maximum log-likelihood, with lighter values representing higher composite likelihoods. Numbers above (A) and (B) indicate the total number of sweeps in the rice genome corresponding to each value of κ, and numbers to the right of (B) and (D) represent the selection coefficient, s, corresponding to each value of α, substituting an effective recombination rate of r = 10−12 and ln(2N) = 10 into the expression: α ≈ rs −1 ln(2N), then solving for s. Sweep density is confounded with selection strength due to the effect of a mating system change on recombination rate. In domesticated rice, the transition to selfing likely occurred simultaneously with the sweeps, making it difficult to disentangle the recombination rate and selfing parameters. Under a recent selective sweep in a randomly mating population, the compound parameter α ≈ rs −1 ln(2N), where r is the per-basepair recombination rate, s is the selection coefficient and N is the population size [39]. In a partially selfing population such as domesticated rice, however, both effective recombination rate and population size are affected by selfing rate. While the rate of coalescence (and hence the effective population size) is at most doubled by the rate of selfing, the rate of recombination can be radically altered. An expression for effective recombination rate is r(1 − σ/[2 − σ]), where σ is the selfing rate [40]. For domesticated rice, estimates of selfing rates are typically ∼0.99 [13], resulting in a reduced recombination rate by approximately 10−3. If we assume 400 selective sweeps occurred in the rice genome since domestication (κ = 10−6), we estimate that α = 2 × 10−12 for indica. With r = 10−9 recombination events per generation per base pair and ln(2N) ≈ 10, this estimate of α corresponds to an unreasonably high estimate of a 5,000-fold fitness advantage. Substituting an effective recombination rate of 10−12 (corresponding to a reduced effective rate due to selfing), we find more reasonable values for the strength of selection for the selective sweeps, with s ≈ 5. This example illustrates how high selfing rates can amplify the signal of selection and contribute to the pattern of polymorphism in the rice genome. Comparing Models to Explain Patterns of Nucleotide Polymorphism in Rice Visually, it appears both the bottleneck plus sweeps model and the bottleneck plus migration model predict the site-frequency spectrum of domesticated rice better than the bottleneck model alone (Figure 4) or the pure selection model (unpublished data). To compare likelihoods and determine which model best fits the data, we used the Akaike information criterion (AIC) [41]. Since SNPs in our dataset are linked, we used a composite likelihood function and simulations to assign p-values to the observed AIC statistic (see Materials and Methods). For indica, the bottleneck plus sweeps model is significantly better than the neutral bottleneck model (Λ = −17.18, p Λ obs . Supporting Information Figure S1 Clustering of Oryza Accessions Based on Neighbor-Joining Analysis of Concatenated STS Sequences Numbers by branches are bootstraps of 1,000 replicates. Only branches with a bootstrap value higher than 60% for major clades (five or more accessions) are labeled. The monophyly of each rice variety group is well supported, with the exception of tropical japonica. The accession M202-new is an elite temperate japonica line that has been subjected to possible crosses with other groups, perhaps explaining its inclusion within the tropical japonica. (409 KB EPS) Click here for additional data file. Figure S2 Frequency Distribution of Tajima's D Values for All STS Sampled in (A) indica, (B) tropical japonica, and (C) O. rufipogon (373 KB EPS) Click here for additional data file. Figure S3 The Distribution of Allele Frequency in O. rufipogon for Derived Alleles That Are at High Frequency in indica or tropical japonica Notably, most alleles are at low to intermediate frequency in O. rufipogon, consistent with multiple selective sweeps in O. sativa, and discounting the possibility of misidentification of ancestral alleles or interspecific introgression being responsible for the pattern observed in rice. HFD, high-frequency derived SNPs. (390 KB EPS) Click here for additional data file. Figure S4 The Genomic Distribution of STS Fragments Contributing High-Frequency Derived SNPs in indica and tropical japonica In each group, high-frequency derived SNPs occur in ten of 12 rice chromosomes. Fragments containing high-frequency derived SNPs comprise a large portion of fragments containing any variation at all in each O. sativa group. In both rice groups, the sample of STS fragments used to construct the site-frequency spectrum is slightly lower than 111 due to missing data in O. meridionalis. (433 KB EPS) Click here for additional data file. Table S1 Oryza Accessions Used in the Study and Inferred Ancestry Coefficients (45 KB XLS) Click here for additional data file. Table S2 STS Fragment Information and Silent Sites Diversity Measures for the Various Oryza Groups (133 KB XLS) Click here for additional data file. Accession Numbers The National Center for Biotechnology Information GenBank (http://www.ncbi.nlm.nih.gov/Genbank) ID numbers for the sequences and alignments discussed in this article are EF000002–EF010509.
                Bookmark

                Author and article information

                Contributors
                Journal
                Front Plant Sci
                Front Plant Sci
                Front. Plant Sci.
                Frontiers in Plant Science
                Frontiers Media S.A.
                1664-462X
                22 June 2018
                2018
                : 9
                : 849
                Affiliations
                [1] 1International Rice Research Institute , Los Baños, Philippines
                [2] 2Institute of Crop Science, University of the Philippines Los Baños , Los Baños, Philippines
                [3] 3National Institute of Biotechnology and Genetic Engineering , Faisalabad, Pakistan
                [4] 4Data2Bio, LLC , Ames, IA, United States
                [5] 5Department of Agronomy, Iowa State University , Ames, IA, United States
                [6] 6Department of Plant Genetics and Breeding, China Agricultural University , Beijing, China
                [7] 7National Key Facility for Crop Gene Resources and Genetic Improvement, Institute of Crop Science, Chinese Academy of Agricultural Sciences , Beijing, China
                [8] 8Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences , Shenzhen, China
                Author notes

                Edited by: Petr Smýkal, Palacký University Olomouc, Czechia

                Reviewed by: Martin Mascher, Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung, Germany; Umesh K. Reddy, West Virginia State University, United States

                *Correspondence: Jauhar Ali, j.ali@ 123456irri.org

                These authors have contributed equally to this work.

                These authors have also contributed equally to this work.

                This article was submitted to Plant Breeding, a section of the journal Frontiers in Plant Science

                Article
                10.3389/fpls.2018.00849
                6024854
                bd0d2c49-32e1-4ec3-a450-d1f270fe6a8e
                Copyright © 2018 Ali, Aslam, Tariq, Murugaiyan, Schnable, Li, Marfori-Nazarea, Hernandez, Arif, Xu and Li.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                History
                : 28 September 2017
                : 31 May 2018
                Page count
                Figures: 6, Tables: 1, Equations: 0, References: 43, Pages: 10, Words: 0
                Funding
                Funded by: Bill and Melinda Gates Foundation 10.13039/100000865
                Award ID: ID OPP1130530
                Categories
                Plant Science
                Original Research

                Plant science & Botany
                snp-typing,tunable genotyping by sequencing (tgbs),conventional genotyping by sequencing (cgbs),introgression breeding,non-synonymous snps,marker-assisted breeding

                Comments

                Comment on this article