Inviting an author to review:
Find an author and click ‘Invite to review selected article’ near their name.
Search for authorsSearch for similar articles
91
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Genomic Evidence of Rapid and Stable Adaptive Oscillations over Seasonal Time Scales in Drosophila

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          In many species, genomic data have revealed pervasive adaptive evolution indicated by the fixation of beneficial alleles. However, when selection pressures are highly variable along a species' range or through time adaptive alleles may persist at intermediate frequencies for long periods. So called “balanced polymorphisms” have long been understood to be an important component of standing genetic variation, yet direct evidence of the strength of balancing selection and the stability and prevalence of balanced polymorphisms has remained elusive. We hypothesized that environmental fluctuations among seasons in a North American orchard would impose temporally variable selection on Drosophila melanogaster that would drive repeatable adaptive oscillations at balanced polymorphisms. We identified hundreds of polymorphisms whose frequency oscillates among seasons and argue that these loci are subject to strong, temporally variable selection. We show that these polymorphisms respond to acute and persistent changes in climate and are associated in predictable ways with seasonally variable phenotypes. In addition, our results suggest that adaptively oscillating polymorphisms are likely millions of years old, with some possibly predating the divergence between D. melanogaster and D. simulans. Taken together, our results are consistent with a model of balancing selection wherein rapid temporal fluctuations in climate over generational time promotes adaptive genetic diversity at loci underlying polygenic variation in fitness related phenotypes.

          Author Summary

          Herein, we investigate the genomic basis of rapid adaptive evolution in response to seasonal fluctuations in the environment. We identify hundreds of polymorphisms (seasonal SNPs) that undergo dramatic shifts in allele frequency – on average between 40 and 60% – and oscillate between seasons repeatedly over multiple years, likely inducing high levels of genome-wide genetic differentiation. We provide evidence that seasonal SNPs are functional, being both sensitive to an acute frost event and associated with two stress tolerance traits. Finally, we show that some seasonal SNPs are possibly ancient balanced polymorphisms. Taken together, our results suggest that environmental heterogeneity can promote the long-term persistence of functional polymorphisms within populations that fuels fast directional adaptive response at any one time.

          Related collections

          Most cited references58

          • Record: found
          • Abstract: found
          • Article: not found

          Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans

          Introduction Given the long history of Drosophila as a central model system in evolutionary genetics beginning with the origins of empirical population genetics in the 1930s, it is unsurprising that Drosophila data have inspired the development of methods to test population genetic theories using DNA variation within and between closely related species [1–4]. These methods rest on the supposition of the neutral theory of molecular evolution that polymorphism and divergence are manifestations of mutation and genetic drift of neutral variants at different time scales [5]. Under neutrality, polymorphism is a “snapshot” of variation, some of which ultimately contributes to species divergence as a result of fixation by genetic drift. Natural selection, however, may cause functionally important variants to rapidly increase or decrease in frequency, resulting in patterns of polymorphism and divergence that deviate from neutral expectations [1,2,6]. A powerful aspect of inferring evolutionary mechanism in this population genetic context is that selection on sequence variants with miniscule fitness effects, which would be difficult or impossible to study in nature or in the laboratory but are evolutionarily important, may cause detectable deviations from neutral predictions. Another notable aspect of these population genetic approaches is that they facilitate inferences about recent selection—which may be manifest as reduced polymorphism or elevated linkage disequilibrium—or about selection that has occurred in the distant past—which may be manifest as unexpectedly high levels of divergence. The application of these conceptual advances to the study of variation in closely related species has resulted in several fundamental advances in our understanding of the relative contributions of mutation, genetic drift, recombination, and natural selection to sequence variation. However, it is also clear that our genomic understanding of population genetics has been hobbled by fragmentary and nonrandom population genetic sampling of genomes. Thus, the full value of genome annotation has not yet been applied to the study of population genetic mechanisms. Combining whole-genome studies of genetic variation within and between closely related species (i.e., population genomics) with high-quality genome annotation offers several major advantages. For example, we have known for more than a decade that regions of the genome experiencing reduced crossing over in Drosophila tend to show reduced levels of polymorphism yet normal levels of divergence between species [7–10]. This pattern can only result from natural selection reducing levels of polymorphism at linked neutral sites, because it violates the neutral theory prediction of a strong positive correlation between polymorphism and divergence [5]. However, we have no general genomic description of the physical scale of variation in polymorphism and divergence in Drosophila and how such variation might be related to variation in mutation rates, recombination rates, gene density, natural selection, or other factors. Similarly, although several Drosophila genes have been targets of molecular population genetic analysis, in many cases, these genes were not randomly chosen but were targeted because of their putative association with phenotypes thought to have a history of adaptive evolution [11,12]. Such biased data make it difficult to estimate the proportion of proteins diverging under adaptive evolution. In a similar vein, the unique power of molecular population genetic analysis, when used in concert with genome annotation, could fundamentally alter our notions about phenotypic divergence due to natural selection. This is because our current understanding of phenotypic divergence and its causes is based on a small and necessarily highly biased description of phenotypic variation. Alternatively, a comprehensive genomic investigation of adaptive divergence could use genome annotations to reveal large numbers of new biological processes previously unsuspected of having diverged under selection. Here we present a population genomic analysis of D. simulans. D. simulans and D. melanogaster are closely related and split from the outgroup species, D. yakuba, several million years ago [13–15]. The vast majority of D. simulans and D. yakuba euchromatic DNA is readily aligned to D. melanogaster, which permits direct use of D. melanogaster annotation for investigation of polymorphism and divergence and allows reliable inference of D. simulans–D. melanogaster ancestral states over much of the genome. Our analysis uses a draft version of a D. yakuba genome assembly (aligned to the D. melanogaster reference sequence) and a set of light-coverage, whole-genome shotgun data from multiple inbred lines of D. simulans, which were syntenically aligned to the D. melanogaster reference sequence. Results/Discussion Genomes and Assemblies Seven lines of D. simulans and one line of D. yakuba were sequenced at the Washington University Genome Sequencing Center (the white paper can be found at http://www.genome.gov/11008080). The D. simulans lines were selected to capture variation in populations from putatively ancestral geographic regions [16], recent cosmopolitan populations, and strains encompassing the three highly diverged mitochondrial haplotypes previously described for the species [17]. These strains have been deposited at the Tucson Drosophila Stock Center (http://stockcenter.arl.arizona.edu). A total of 2,424,141 D. simulans traces and 2,245,197 D. yakuba traces from this project have been deposited in the National Center for Biotechnology Information (NCBI) trace archive. D. simulans syntenic assemblies were created by aligning trimmed, uniquely mapped sequence traces from each D. simulans strain to the euchromatic D. melanogaster reference sequence (v4). Two strains from the same population, sim4 and sim6, were unintentionally mixed prior to library construction; reads from these strains were combined to generate a single, deeper, syntenic assembly (see Materials and Methods), which is referred to as SIM4/6. The other strains investigated are referred to as C167.4, MD106TS, MD199S, NC48S, and w501 . Thus, six (rather than seven) D. simulans syntenic assemblies are the objects of analysis. Details on the fly strains and procedures used to create these assemblies, including the use of sequence quality scores, can be found in Materials and Methods. The coverages (in Mbp) for C167.4, MD106TS, MD199S, NC48S, SIM4/6, and w501 , are 56.9, 56.3, 63.4, 42.6, 89.8, and 84.8, respectively. A D. yakuba strain Tai18E2 whole-genome shotgun assembly (v2.0; http://genome.wustl.edu/) generated by the Parallel Contig Assembly Program (PCAP) [18] was aligned to the D. melanogaster reference sequence (Materials and Methods). The main use of the D. yakuba assembly was to infer states of the D. simulans–D. melanogaster ancestor. For many analyses, we used divergence estimates for the D. simulans lineage or the D. melanogaster lineage (from the inferred D. simulans–D. melanogaster ancestor) rather than the pairwise (i.e., unpolarized) divergence between these species. These lineage-specific estimates are often referred to as “D. simulans divergence,” “D. melanogaster divergence,” or “polarized divergence.” A total of 393,951,345 D. simulans base pairs and 102,574,197 D. yakuba base pairs were syntenically aligned to the D. melanogaster reference sequence. Several tens of kilobases of repeat-rich sequences near the telomeres and centromeres of each chromosome arm were excluded from our analyses (Materials and Methods). D. simulans genes were conservatively filtered for analysis based on conserved physical organization and reading frame with respect to the D. melanogaster reference sequence gene models (Materials and Methods). We took this conservative approach so as to retain only the highest quality D. simulans data for most inferences. The number of D. simulans genes remaining after filtering was 11,466. Ninety-eight percent of coding sequence (CDS) nucleotides from this gene set are covered by at least one D. simulans allele. The average number of lines sequenced per aligned D. simulans base was 3.90. For several analyses in which heterozygosity and divergence per site were estimated, we further filtered the data so as to retain only genes or functional elements (e.g., untranslated regions [UTRs]) for which the total number of bases sequenced across all lines exceeded an arbitrary threshold (see Materials and Methods). The numbers of genes for which we estimated coding region expected heterozygosity, unpolarized divergence, and polarized divergence were 11,403, 11,439, and 10,150, respectively. Coverage on the X chromosome was slightly lower than autosomal coverage, which is consistent with less X chromosome DNA than autosomal DNA in mixed-sex DNA preps. Variable coverage required analysis of individual coverage classes (n = 1–6) for a given region or feature, followed by estimation and inference weighted by coverage (Materials and Methods). The D. simulans syntenic alignments are available at http://www.dpgp.org/. An alternative D. simulans “mosaic” assembly, which is available at http://www.genome.wustl.edu/, was created independently of the D. melanogaster reference sequence. General Patterns of Polymorphism and Divergence Nucleotide variation. We observed 2,965,987 polymorphic nucleotides, of which 43,878 altered the amino acid sequence; 77% of sampled D. simulans genes were segregating at least one amino acid polymorphism. The average, expected nucleotide heterozygosity (hereafter, “heterozygosity” or “πnt”) for the X chromosome and autosomes was 0.0135 and 0.0180, respectively. X chromosome πnt was not significantly different from that of the autosomes (after multiplying X chromosome πnt by 4/3, to correct for X/autosome effective population size differences when there are equal numbers of males and females; see [19]). However, X chromosome divergence was greater than autosomal divergence in all three lineages (50-kb windows; Table 1, Table S1, Figure 1, Dataset S8). We will discuss this pattern in greater detail below. Table 1 Autosome and X Chromosome Weighted Averages of Nucleotide Heterozygosity (π) and Lineage Divergence Figure 1 Patterns of Polymorphism and Divergence of Nucleotides along Chromosome Arms Nucleotide π (blue) and div on the D. simulans lineage (red) in 150-kbp windows are plotted every 10 kbp. χ[–log(p)] (olive) as a measure of deviation (+ or –) in the proportion of polymorphic sites in 30-kbp windows is plotted every 10 kbp (see Materials and Methods). C and T correspond to locations of centromeres and telomeres, respectively. Chromosome arm 3R coordinates correspond to D. simulans locations after accounting for fixed inversion on the D. melanogaster lineage. Not surprisingly, many patterns of molecular evolution identified from previously published datasets were confirmed in this genomic analysis. For example, synonymous sites and nonsynonymous sites were the fastest and slowest evolving sites types, respectively [20–24]. Nonsynonymous divergence (dN) and synonymous divergence (dS) were positively, though weakly, correlated (r 2 = 0.052, p 6 for each of polymorphisms, fixations, synonymous variants, and nonsynonymous variants (Dataset S1). The filtered data set of unpolarized MK tests contained 6,702 genes, of which 1,270 (19%) were significant (in the direction of adaptive evolution) at the 0.05 critical value and 539 (8%) genes were significant at a 0.01 critical value. Given that MK tests can only detect directional selection when multiple beneficial mutations have fixed, these results provide a conservative view of the prevalence of adaptive protein divergence. There was a slight enrichment of significant unpolarized MK tests on the autosomes relative to the X chromosome (Fisher's Exact test, p = 0.0014). However, conclusions regarding the incidence of directional selection on autosomes versus the X chromosome should be tempered by the fact that the average numbers of polymorphic and fixed variants per gene may differ between the two types of chromosomes, which affects the power of the MK test to reject neutrality. We observed no enrichment of significant tests in regions of the X chromosome hypothesized to experience greater versus lower rates of crossing over. Several of the most highly significant MK test statistics are from genes with known functions and in many cases, known names and mutant phenotypes. More generally, among the genes with no associated GO term, a smaller proportion had significant unpolarized MK tests compared to the proportion for genes associated with one or more GO terms (0.16 versus 0.20, p = 3 × 10−5). Included among the most highly significant genes in the unpolarized MK tests (Table S12) were several with reproduction-related functions. For example, the sperm of males carrying mutations in Pkd2 (CG6504), the gene with the smallest MK p-value in the genome, are not properly stored in females, suggesting sperm–female interactions (perhaps associated with sperm competition) as a possible agent of selection [92,93]. Other examples include Nc (CG8091), which plays a role in sperm individualization [94]; Acxc (CG5983), a sperm-specific adenylate cyclase [95]; and Dhc16F (CG7092), which is a component of the axonemal dynein complex (suggesting a possible role of selection on sperm motility). For polarized MK tests, we used the D. yakuba genome to infer which fixed differences between D. simulans and D. melanogaster occurred along the D. simulans lineage (Materials and Methods). These fixations were then compared to D. simulans polymorphisms. This reduced, filtered dataset contained 2,676 genes of which 384 (14%) and 169 (6%) were significant at the 0.05 and 0.01 levels, respectively (deviating in the direction of adaptive evolution; Datasets S1). Twenty-three genes showed evidence of a significant (p 10 genes include nuclear envelope, nuclear pore, amino acid-polyamine transporter activity, ubiquitin-specific protease activity, protein deubiquitination, and protein import into the nucleus. Results from a comparable analysis of D. melanogaster protein evolution are shown in Table S21. Using the same criteria of n > 10 genes and p 2) indicative of directional selection on 5′ UTRs, 3′ UTRs, and intron sequences, respectively. Among the most unusual 5′UTRs are those associated with genes coding for proteins associated with the cytoskeleton or the chromosome, categories that also appeared as unusual in the MK tests on protein variation. Two of the top-ten 3′ UTRs are associated with the SAGA complex, a multi-subunit transcription factor involved in recruitment of RNA Pol II to the chromosome [111]. Among the extreme introns, two are from genes coding for components of the ABC transporter complex and two are from genes coding for centrosomal proteins, again pointing to the unusual evolution of genes associated with the cytoskeleton and chromosome structure and movement. As previously noted, a large number of significant UTRs deviate in the direction of excess polymorphism (relative to synonymous mutations). Given the potential importance of the UTRs in regulating transcript abundance and localization, translational control, and as targets of regulatory microRNAs [112], such UTRs could be attractive candidates for functional investigation. Contingency tests of significant versus nonsignificant MK test for amino acids versus each of the noncoding elements yielded p-values of 0.65, 0.04, and 0.07 for 5′ UTRs, 3′ UTRs, and introns, respectively. Thus, there is weak evidence that genes under directional selection on amino acid sequences tend to have 3′ UTRs and introns influenced by directional selection as well. Whole-Genome Analysis of Polymorphic and Fixed Variants Up to this point, our analyses have investigated various attributes of polymorphism and divergence based on windows or genes. An alternative approach for understanding the causes of variation and divergence is to analyze polymorphism and divergence across site types. Table 2 shows whole-genome counts of polymorphic and polarized fixed variants for UTRs, synonymous sites, nonsynonymous sites, introns, and intergenic sites. We also provide data for polarized, synonymous preferred or unpreferred variants. Almost all preferred versus unpreferred codons in D. melanogaster end in GC versus AT, respectively [113]; thus, preferred versus unpreferred codons can be thought of as GC-ending versus AT-ending codons. Table 2 Whole-Genome Counts of Polarized Polymorphic and Fixed Variants Nonsynonymous sites showed the smallest ratio of polymorphic-to-fixed variants, which is consistent with the MK tests and supports the idea that such sites are the most likely to be under directional selection. Nonsynonymous polymorphisms also occur at slightly lower frequency than do noncoding variants (Table S25). Synonymous sites have the highest ratio of polymorphic-to-fixed variants, which supports the previously documented elevated ratio of polymorphic-to-fixed unpreferred synonymous variants in D. simulans [89]. The confidence intervals of the ratio of polymorphic-to-fixed variants among site types are nonoverlapping with the exception of intron and intergenic sites. If preferred synonymous mutations are, on average, beneficial [89,114], then the smaller polymorphic-to-fixed ratio for nonsynonymous and UTR variants versus preferred variants implies that a large proportion of new nonsynonymous and UTR mutations are beneficial. Using similar reasoning, the data in Table 2 suggest that directional selection plays a larger role in nonsynonymous and UTR divergence compared to intron and intergenic divergence [20,115,116]. These conclusions are consistent with estimates of α [11,117], the proportion of sites fixing under directional selection (assuming that synonymous sites are neutral and at equilibrium) for different site types. Base Composition Evolution Determining the relative contributions of various mutational and population genetic processes to base composition variation and inferring the biological basis of selection on base composition remain difficult problems. Much of the previously published data on base composition variation in D. simulans have been from synonymous sites [55,89,90,118]. Several lines of evidence [55,89,90,113,118] suggest that on average, preferred codons have higher fitness than unpreferred codons, with variation in codon usage being maintained by AT-biased mutation, weak selection against unpreferred codons, and genetic drift [23,114]. However, the possibilities of nonequilibrium mutational processes and/or natural selection favoring different base composition in different lineages have also been addressed [119]. The D. simulans population genomic data allow for a thorough investigation of the population genetics and evolution of base composition for both coding and noncoding DNA [59,120]. The analyses discussed below use parsimony to polarize polymorphic and fixed variants. Complete genomic and gene-based data are available as Datasets S9 and S10. Synonymous sites. Previous reports suggested that D. simulans synonymous sites are evolving towards higher AT content, although the excess of AT over GC fixations is small [55]. That trend was confirmed in this larger dataset; there are many more ancestral preferred codons that have fixed an unpreferred codon (coverage classes four–six, n = 21,156) in D. simulans compared with ancestral unpreferred codons that have fixed a preferred codon (coverage classes four–six, n = 15,409). Furthermore, the population genomic data also support previous reports [89] that D. melanogaster synonymous sites are becoming AT-rich at a faster rate than D. simulans synonymous sites (Table S26), contributing to the higher median dS in D. melanogaster (0.069) compared to D. simulans (0.051, Wilcoxon Signed Rank, p 200 bp. Expected heterozygosity was also estimated for genomic features (exons, introns, UTRs, and intergenic sequence) that had a minimum size and coverage [i.e., n(n – 1) × s ≥ 100, where n = average number of alleles sampled and s = number of sites]. For coding regions, the numbers of silent and replacement sites were counted using the method of Nei and Gojobori [129]. The pathway between two codons was calculated as the average number of silent and replacement changes from all possible paths between the pair. The variance of pairwise differences in sliding windows (150-kb windows, 10-kb increments) was used as a method of summarizing the magnitude of linkage disequilibrium across the D. simulans genome. For each window, we calculated coverage weighted variance of the expected heterozygosity (see above) for all pairs of alleles. Divergence. Unpolarized (i.e., pairwise) divergence between D. simulans and D. melanogaster was estimated for 10-kb windows, 50-kb windows, 30-kb sliding windows (10-kb increments), 150-kb sliding windows (10-kb increments), 210-kb windows (10-kb increments), and genomic feature that had a minimum number of nucleotides represented [i.e., n × s > 100, with n and s as above in calculations of π. Unpolarized divergence was calculated as the average pairwise divergence at each site, which was then summed over sites and divided by the total number of sites. A Jukes-Cantor [130] correction was applied to account for multiple hits. For coding regions, the numbers of silent and replacement sites were counted using the method of Nei and Gojobori [129]. The pathway between two codons was calculated as the average number of silent and replacement changes from all possible paths between the pair. Estimates of unpolarized divergence over chromosome arms were calculated for each feature with averages weighted by the number of sites per feature. Lineage-specific divergence was estimated by maximum likelihood using PAML v3.14 [131] and was reported as a weighted average over each line with greater than 50 aligned sites in the segment being analyzed. Maximum likelihood estimates of divergence were calculated over 10-kb windows, 50-kb windows, 30-kb sliding windows (10-kb increments), 150-kb sliding windows (10-kb increments), 210-kb windows (10-kb increments), and gene features (exons, introns, and UTRs). PAML was run in batch mode using a BioPerl wrapper [132]. For noncoding regions and windows, we used baseml with HKY as the model of evolution to account for transition/transversion bias and unequal base frequencies [133]; for coding regions, we used codeml with codon frequencies estimated from the data. Insertion and deletion divergence was calculated as divi , the coverage-weighted average divergence of deletions (i = ▵) or insertions (i = ▿) per base pair. where nc is the number of aligned base pairs in the genomic region (e.g., gene feature or window) with sequencing coverage c. kcj is the number of sites in this region with coverage c at which the derived state with respect to the D. melanogaster reference sequence (▵ or ▿) occurs in j out of the c sequences. MK tests (unpolarized and polarized). Unpolarized MK tests [4] used D. simulans polymorphism data and the D. melanogaster reference sequence for counting fixed differences. Polarized MK tests used D. yakuba to infer the D. simulans/D. melanogaster ancestral state. For both polarized and unpolarized analyses, we took the conservative approach of retaining for analysis only codons for which there were no more than two alternative states. For cases in which two alternative codons differed at more than one position, we used the pathway between codons that minimized the number of nonsynonymous substitutions. This is conservative with respect to the alternative hypothesis of adaptive evolution. Polymorphic codons at which one of the D. simulans codons was not identical to the D. melanogaster codon were not included. To reduce multiple testing problems, we filtered the data to retain for further analysis only genes that exceeded a minimum number of observations; we required that each row and column in the 2 × 2 table (two variant types and polymorphic versus fixed) sum to six or greater. Statistical significance of 2 × 2 contingency tables was determined by Fisher's Exact test. MK tests were also carried out for introns and Gold Collection UTRs by comparing synonymous variants in the associated genes with variants in these functional elements. For intergenic MK tests, we used synonymous variants from genes within 5 kb of the 5′ and/or 3′boundary of the intergenic segment. For some analyses, we restricted our attention to MK tests that rejected the null in the direction of adaptive evolution. This categorization was determined following Rand and Kann [134]. Polarized 2 × 2 contingency tables were used to calculate α, which under some circumstances can be thought of as an estimate of the proportion of variants fixing under selection [11]. Bootstrap confidence intervals of α and of the ratio of polymorphic-to-fixed variants for each functional element (Table 2) were estimated in R using bias correction and acceleration [135]. Rate variation. Our approach takes overall rate variation among lineages into account when generating expected numbers of substitutions under the null model and allows for different rates of evolution among chromosome arms (e.g., a faster-X effect). For example, the number of substitutions for all X-linked 50-kb windows was estimated using PAML (baseml), allowing different rates for each lineage. All D. simulans lines were used, with the estimated substitution D. simulans rate for each window being the coverage-weighted average. This generated an empirically determined branch length of the X chromosome for the average over each of the D. simulans lines (from all three way comparisons with D. melanogaster and D. yakuba) weighted by the number of bases covered. We carried out a relative rate test for windows or features in D. simulans and D. melanogaster by generating the expected number of substitutions for each window/feature/lineage based on the branch length of the entire chromosome in each lineage (PAML) and the coverage of the window/feature in question in each lineage. We then calculated the deviation from the expected number of substitutions as (observed – expected substitutions)2/expected substitutions for any window/feature/lineage. GO by MK permutations. For each GO term associated with at least five MK tests, we calculated the proportion of significant (p Genes Located in Genomics Regions Showing Disproportionate Reductions of Nucleotide Heterozygosity in the US Sample (68 KB DOC) Click here for additional data file. Table S11 GO Terms Overrepresented in Windows from Out-of-Africa/Madagascar Analysis. MF and BP, molecular function and biological process, respectively (50 KB DOC) Click here for additional data file. Table S12 GO Terms Associated with the Top 20 Genes with the Smallest Unpolarized MK Test p-Value (118 KB DOC) Click here for additional data file. Table S13 Genes Showing Excess Protein Polymorphism (p 2) (55 KB DOC) Click here for additional data file. Table S23 Genes Associated with the Most-Significant 3′ UTR Polarized MK Tests (Average Coverage per Site > 2) (52 KB DOC) Click here for additional data file. Table S24 Genes Associated with the Most-Significant Intron MK Tests (Average Coverage per Site > 2) (64 KB DOC) Click here for additional data file. Table S25 Number (Frequency) of Nonsynonymous and Noncoding Polymorphisms (Sites with Coverage of n = 5 or n = 6 D. simulans Alleles) for Different Frequency Classes (40 KB DOC) Click here for additional data file. Table S26 Counts and Substitution Rates per Site of Preferred and Unpreferred Variants “Fixed” along the D. simulans and D. melanogaster Lineages (Inferred by Parsimony) Substitution rates were determined by dividing the number of preferred/unpreferred fixations by the number of unpreferred/preferred ancestral bases. (74 KB DOC) Click here for additional data file. Table S27 X and A, Polymorphic and Fixed, Preferred and Unpreferred Variants for Sites with Coverages Four, Five, or Six (33 KB DOC) Click here for additional data file. Table S28 Unpreferred Polymorphisms (Coverage Five Sites) Occur at Lower Frequency than Preferred Polymorphisms (30 KB DOC) Click here for additional data file. Table S29 Genes with Significant Polarized MK Tests Have a Higher Proportion of Preferred Fixations than Genes with Nonsignificant MK Tests (27 KB DOC) Click here for additional data file. Table S30 Preferred, Unpreferred, and Noncoding GC/AT Fixed Variants across the Genome (Coverage Classes Three–Six) (27 KB DOC) Click here for additional data file. Table S31 Polymorphic GC Variants Occur at Higher Frequency than Polymorphic AT Variants X-linked polymorphic GC variants occur at higher frequency than autosomal polymorphic GC variants (coverage-six polymorphisms from intergenic and intron DNA). (32 KB DOC) Click here for additional data file. Table S32 D. yakuba Genome Input and Assembly Statistics Statistics presented are for the whole-genome assembly before it was anchored using alignments to D. melanogaster. “Contigs” are contiguous sequences not interrupted by gaps, and “supercontigs” are ordered and oriented “contigs” including estimated gap sizes. The N50 statistic is defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L. The total contig size was 167 Mb, with 97% of the consensus base pairs having quality scores of at least 40 (Q40) (expected error rate of less than or equal to 10−4) and 98% are at least Q20. (59 KB DOC) Click here for additional data file. Table S33 Read and Trim Statistics for D. simulans Syntenic Assemblies (35 KB DOC) Click here for additional data file. Table S34 Correlation (Kendall's τ) between Copy Numbers of TE Families in “Trimmed” Euchromatic Regions of D. simulans and D. melanogaster The simulans TEs are the “clustered” TEs. The melanogaster TEs are those annotated in release 4.0. (31 KB DOC) Click here for additional data file. Table S35 Tests of the Homogeneity of the Proportions of Each Family across Six D. simulans Lines, Homogeneity of Classes across Lines, and Homogeneity of Families within Classes across Lines (33 KB DOC) Click here for additional data file. Table S36 Test of the Homogeneity of Relative Family Copy Numbers across the Five Chromosome Arms (Pooled across Lines) for All TEs and within the Four Classes (33 KB DOC) Click here for additional data file. Table S37 Test of the Homogeneity of Relative Family Copy Numbers on the X chromosome versus the Autosomes (Pooled across Lines) for All TEs and within the Four Classes (32 KB DOC) Click here for additional data file. Table S38 Heterogeneity of “Cloned” TE Numbers in Various Gene Annotation Elements (29 KB DOC) Click here for additional data file. Table S39 Comparison of Expected D. simulans Nucleotide Heterozygosity and Divergence for 30-kb Windows Centered on the Estimated Position of “Clustered” TEs (+) Compared to Windows without Clustered TEs (–) The difference between the distributions (TEs: +/-) was tested with the Mann-Whitney U test; the p-value is in the upper position in the last column (probability < / ratio). The ratio of the means is also shown (lower in last column). (50 KB DOC) Click here for additional data file. Text S1 Transposable Elements (48 KB DOC) Click here for additional data file. Accession Numbers The GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) accession number for D. yakuba is AAEU01000000 (version 1) and for the D. simulans w501 whole-genome shotgun assembly is TBS-AAEU01000000 (version 1).
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster.

            Two genomic regions with unusually low recombination rates in Drosophila melanogaster have normal levels of divergence but greatly reduced nucleotide diversity, apparently resulting from the fixation of advantageous mutations and the associated hitch-hiking effect. Here we show that for 20 gene regions from across the genome, the amount of nucleotide diversity in natural populations of D. melanogaster is positively correlated with the regional rate of recombination. This cannot be explained by variation in mutation rates and/or functional constraint, because we observe no correlation between recombination rates and DNA sequence divergence between D. melanogaster and its sibling species, D. simulans. We suggest that the correlation may result from genetic hitch-hiking associated with the fixation of advantageous mutants. Hitch-hiking thus seems to occur over a large fraction of the Drosophila genome and may constitute a major constraint on levels of genetic variation in nature.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The Many Landscapes of Recombination in Drosophila melanogaster

              Introduction Recombination is a fundamental biological process. Mechanistically, most sexual eukaryotes require recombination between homologous chromosomes for proper formation of haploid gametes from diploid germ cells [1]. Evolutionarily, recombination is predicted to increase the effectiveness of selection in natural populations, thus explaining the pervasiveness of recombination and sex [2]–[9]. In most species, recombination rates also vary among and along chromosomes. This intra-genomic variation provides a unique opportunity to evaluate the evolutionary consequences of recombination due to exposure to identical demographic and environmental factors; an all else being equal premise that is hardly ever warranted when comparing populations or species. Population genetic analyses across genomes have confirmed the profound effects that recombination imposes on the evolutionary process, shaping levels of genetic variation, limiting the accumulation of deleterious mutations and enhancing rates of adaptation in many species including Drosophila, Arabidopsis thaliana, Saccharomyces cerevisiae and nematodes [10]–[25]. Yet these studies have been hindered, at least in part, by three limitations associated with the lack of detailed characterization of natural variation in recombination across genomes and within species. First, population genomic analyses can now analyze patterns of selection at the scale of single genes, or even focus on specific gene regions, but most whole-genome genetic maps localize recombination events with much less detail [24], [26]–[31]. Second, most species' maps based on direct measures of recombination are obtained either from two specific, usually highly diverged, strains (e.g., Caenorhabditis elegans [30] or S. cerevisiae [32]), or compiled from crosses in different laboratory/natural conditions that can also influence recombination rates [13], [26], [28]. These genetic maps are customarily assumed to represent a monomorphic description of recombination for a given species, even when there is ample evidence of intra-specific natural variation. In species like A. thaliana, D. melanogaster, C. elegans, maize, mice or humans, variation has been reported for the number of recombination events at the level of whole chromosomes [28], [30], [33]–[35] or for specific chromosomal intervals [36]–[44], but a high-resolution whole-genome study of variation in genetic maps based on multiple natural strains is lacking. Third, the process of meiotic recombination associated with the repair of double strand breaks (DSB) has two possible outcomes with diverse evolutionary consequences: crossing over (CO) and non-crossing over (or gene conversion; GC). Although CO events also include a gene conversion tract, we will use GC to refer to recombination events that are not accompanied by crossing over (Figure S1). Unlike CO, GC results in the exchange of only small tracts of a chromosome, interrupting linkage disequilibrium in a much localized manner while having no effect on linkage disequilibrium at longer intervals, thus GC plays a stronger role than CO at short physical distances [45]–[48]. Corresponding high-resolution CO and GC maps based on direct experimental detection of recombination events however are only available for the unicellular yeast Saccharomyces cerevisiae [32]. In multicellular organisms, CO maps are used as proxy for total recombination, and the consequences of GC are overlooked, despite its potential influence on total recombination, particularly in regions or chromosomes with limited CO. To alleviate all these deficiencies, we generated high-resolution, whole-genome CO and GC maps from eight crosses between natural strains of D. melanogaster (see Materials and Methods for details). In order to obtain haploid genomes resulting from female meioses, we crossed heterozygous D. melanogaster females to males of D. simulans and genotyped the female hybrid progeny using whole-genome high-throughput Illumina sequencing. Reads corresponding to D. simulans were removed bioinformatically by mapping them to D.simulans genomic sequences. We then generated a whole-genome D. melanogaster haplotype per genotyped fly by mapping high-quality informative SNPs to one of the two parental D. melanogaster strains. Recombination events were detected, and directly assigned as CO or GC events, based on changes in parental identity along each D. melanogaster haplotype. Overall, we characterized the products of 5,860 female meioses and genotyped an average of 49,000 informative SNPs per fly, for a total of 139 million SNPs. We mapped more than 106,000 recombination events (CO and GC combined) with a median distance to the nearest informative SNP of less than 2.0 kb (1.83 kb). This resolution is almost equivalent to the high-resolution mapping of meiotic recombination in the unicellular S. cerevisiae [32], 15-fold higher than the linkage map in A. thaliana also based on recombinant inbred lines [27], and more than 50-fold more detailed than current high-resolution whole-genome CO maps in humans [28], C. elegans [30], C. briggsae [24], or D. pseudoobscura [29]. Results A high-resolution CO map for D. melanogaster Combining the results from all crosses we detected a total of 32,511 CO events that were used to generate high-resolution CO maps in D. melanogaster (Figure 1). Due to the elevated density of markers and the small number of CO events per chromosome and genotyped fly, each CO is supported by many contiguous markers at either side and it is our expectation that we have detected all COs. The total genetic map length for D. melanogaster obtained in our crosses is 287.3 cM, closely matching classical measures (282 cM [26]). A low-resolution approximation to the distribution of CO rates (c) along chromosome arms based on our data (Figure S2) recovers the same general, large-scale distribution as previous maps based on visible markers [11]–[13], [26], [49]–[53]. As expected, c is sharply reduced near telomeres and centromeres, and we detect no CO events in the small fourth (dot) chromosome that proceeds to meiotic segregation without chiasmata [54]. 10.1371/journal.pgen.1002905.g001 Figure 1 Crossing over rate variation along chromosome arms in D. melanogaster. Rate of crossing over (c) based on data from all crosses and indicated in centimorgans (cM) per megabase (Mb) per female meiosis (red line). c is shown along chromosomes for 100-kb windows and a movement between adjacent windows of 50 kb. Blue lines indicate 90% confidence interval for c at each window. Our detailed maps deepen the recent appreciation for intra-chromosomal variation in CO rates in Drosophila [29], [55], [56] and outline this heterogeneity at a much finer scale across the whole genome. Heterogeneity in CO rates along each chromosome is significant at all physical scales analyzed, from 100 kb to 10 Mb, even after removing centromeric and telomeric regions with visibly reduced rates (P 40-fold) relative to either adjacent regions or to other crosses (Figure 2). As expected, crosses sharing one parental strain have more similar maps than crosses not sharing parental strains but the overall magnitude of the correlation between these crosses, albeit significant, is rather small (Spearman's R = +0.451). This observation reinforces the concept of a highly polygenic and polymorphic basis for CO distribution along chromosomes. 10.1371/journal.pgen.1002905.g002 Figure 2 Intra-specific variation in crossing over rates along chromosome arms. c (cM/Mb per female meiosis) for eight different crosses (different colors) and shown for adjacent 250-kb windows. To quantify variation in CO rates among the eight CO maps we estimated the variance to mean ratio (Index of Dispersion; R CO) and tested whether the different number of CO events at a given region can be explained by a Poisson process. Moreover, we focused on variation in the distribution of CO rates along chromosomes and therefore we took into account the number of total events for each chromosome (see Materials and Methods for details). Our study of R CO along chromosomes reveals many regions (107 or 22% of all non-overlapping 250-kb regions across the genome) with a variance among crosses larger than expected (overdispersion) and this pattern is observed in all chromosomes (Figure 3). The magnitude of this excess variance is highest for chromosome arm 2L while notably reduced for the chromosome arm 3L. Significant overdispersion of CO rates among crosses is also detected when we study larger genomic regions. At a physical scale of 1 Mb, more than half of the genomic regions exhibit excess variance, thus suggesting that regions with variable CO rates are frequent enough across the D. melanogaster genome to be playing a detectable role in a large fraction of these longer sequences. 10.1371/journal.pgen.1002905.g003 Figure 3 Estimate of the Index of Dispersion (R CO) for rates of crossing over along chromosome arms. R CO was obtained by comparing crossing over rates from eight crosses (see Materials and Methods for details) and is shown for adjacent 250-kb windows (blue line). The doted red line indicates the P = 0.0005 confidence threshold (equivalent to P ( = 0.05)/number of windows in whole-genome analyses). These results are consistent with early studies in Drosophila that reported natural variation in CO rates based on artificial selection experiments ([61] and references therein). Our genome-wide study details the genomic location and magnitude of this variation and depicts the first high-resolution polymorphic landscape of CO rates in D. melanogaster. Several genomic regions have low rates in all crosses, thus representing monomorphic (or high-frequency) coldspots for CO in D. melanogaster. Other regions assigned as peaks of CO rates based on combined maps, however, are strongly influenced by polymorphic hotspots at low frequency in our sample. In fact, most regions with excess variance in CO rates among crosses are associated with low-frequency hotspots rather than low-frequency coldspots suggesting that hotspots are transient (short-lived) features within D. melanogaster populations. Our results thus indicate that CO rates based on multiple crosses and genotypes are needed to obtain a representative depiction of a “species” recombination landscape. Additionally, the low frequency of the hotpots will strongly influence measures of recombination based on the arithmetic mean of all maps, suggesting higher rates than measures such the harmonic mean or median (see Figure S3 for a comparison between mean and median CO values). Notably, we observe genomic regions with very low (or zero) median CO rates while the sample mean would suggest average rates. Gene conversion maps in D. melanogaster We have detected a total of 74,453 GC events. Nevertheless, GC tracts that lay between adjacent markers are expected to be missed. Moreover, this underestimation is probably variable across the genome due to differences in SNP and marker density. Therefore, we expanded a maximum likelihood algorithm [62] that was proposed for estimating the length of GC tracts (LGC ) to simultaneously estimate LGC and the rate of GC initiation (γ), and be applicable to any region of arbitrary marker distribution and density (see Materials and Methods for details). Our genome-wide estimates of γ and average LGC are 1.25×10−7/bp/female meiosis and 518 bp, respectively. The study of each chromosome arm separately (Figure 4) shows that arms with evidence of CO (2L, 2R, 3L, 3R and X) have similar estimates of γ (1.13–1.49×10−7/bp/female meiosis) and LGC (456–632 bp). Notably, we observe several GC events in the small achiasmatic chromosome fourth where CO is completely absent. Our estimates of γ and LGC for the fourth chromosome are 0.46×10−7/bp/female meiosis and 1062 bp, respectively. 10.1371/journal.pgen.1002905.g004 Figure 4 Gene conversion estimates in D. melanogaster. Joint maximum-likelihood estimates (MLE) of the rate of gene conversion initiation (γ) and mean gene conversion tract length (L GC) in D. melanogaster. γ units are per bp and female meiosis, and L GC in bp. Red/yellow contours represent 95 confidence intervals for γ and L GC for each chromosome arm independently. The blue dot represents the genome average for γ and L GC based on a total of 74,453 observed GC events. The rosy locus in D. melanogaster is one of the best characterized in higher metazoa for intragenic recombination [63], [64]. These studies showed that GC events are more frequent than CO, with four non-crossover associated GC events to each CO [63]–[65]. In terms of absolute rate, the recovery of intragenic CO events at rosy reveals c∼3.0×10−8/bp/female meiosis [66] thus predicting γ∼1.2×10−7/bp/female meiosis at this locus. When we focus on the 100-kb genomic region encompassing the rosy locus our estimate of γ is 1.17×10−7/bp/female meiosis. At a whole-genome scale, our data suggest a γ (1.25×10−7/bp/female meiosis) and a ratio GC∶CO (∼83% of events result in GC) close to, albeit higher than, the estimates at rosy. A major difference between our results and those from the rosy locus however is the mean length of gene conversion tracts, with our average estimate of LGC (518 bp) significantly exceeding the estimate of 352 bp at rosy [62]. Another approach to estimate GC∶CO ratios is based on using an antibody to γ-His2Av as a molecular marker for DSB formation [67] and monitoring the number of γ-His2Av foci in DSB repair-defective mutants [68]. The number of estimated DSB in D. melanogaster using this methodology is up to 24.2 per genome [68], suggesting that 76.2% of all DSB are resolved as GC when we use the observed number of CO events per female meiosis from our data. The moderately higher fraction of GC observed in our study could be explained by differences among the strains used, if not all DSBs (or DSB-repair pathways) are marked by γ-His2Av staining [68] or if the DSB-repair defective mutants allowed for residual repair thus making some DSBs difficult to detect. Of particular interest will be future research focused on trying to localize experimentally DSBs on the fourth chromosome and other genomic regions where CO is absent but GC is detected. The analysis of the distribution of γ along chromosomes at the 100-kb scale reveals a more uniform distribution than that of CO (c) rates, with no reduction near telomeres or centromeres (Figure 5). More than 80% of 100-kb windows show γ within a 2-fold range, a percentage that contrasts with the distribution of CO where only 26.3% of 100-kb windows along chromosomes show c within a 2-fold range of the chromosome average. To test specifically whether the distribution of CO events is more variable across the genome that either GC or the combination of GC and CO events (i.e., number of DSBs), we estimated the coefficient of variation (CV) along chromosomes for each of the three parameters for different window sizes and chromosome arms. In all cases (window size and chromosome arm), the CV for CO is much greater (more than 2-fold) than that for either GC or DSBs (CO+GC), while the CV for DSBs is only marginally greater than that for GC: for 100-kb windows, the average CV per chromosome arm for CO, GC and DSBs is 0.90, 0.37 and 0.38, respectively. Nevertheless, we can also rule out the possibility that the distribution of GC events or DSBs are completely random, with significant heterogeneity along each chromosome (P 0.20) or introns (R = −0.041, P>0.16). An equivalent lack of association is observed when G+C nucleotide composition is compared to c (P>0.25 for both intergenic sequences and introns). We find therefore no evidence of gene conversion bias favoring G and C nucleotides in D. melanogaster based on nucleotide composition. The causes for some of the previous results that inferred gene conversion bias towards G and C nucleotides in Drosophila may be multiple and include the use of sparse CO maps as well as incomplete genome annotation. Because gene density in D. melanogaster is higher in regions with non-reduced CO [13], [79], the many recently annotated transcribed regions and G+C rich exons [31], [80], [81] may have been previously analyzed as neutral sequences, particularly in these genomic regions with non-reduced CO. The motifs of recombination in Drosophila To discover DNA motifs associated with recombination events (CO or GC), we focused on 1,909 CO and 3,701 GC events delimited by 500 bp or less (CO500 and GC500, respectively). Our D. melanogaster data reveal many motifs significantly enriched in sequences surrounding recombination events (18 and 10 motifs for CO and GC, respectively) (Figure 6 and Figure 7). Individually, the motifs surrounding CO events (MCO) are present in 6.8 to 43.2% of CO500 sequences, while motifs surrounding GC events (MGC) are present in 7.8 to 27.6% of GC500 sequences. Note that 97.7% of all CO500 sequences contain at least one MCO motif and 85.0% of GC500 sequences contain one or more MGC motif (Figure S4). 10.1371/journal.pgen.1002905.g006 Figure 6 Top DNA motifs found enriched in sequences encompassing CO events. We focused on 1,909 CO events delimited by 500 bp or less (CO500 sequences). Only motifs with E-vale 100 when c≤0.1 cM/Mb, consistent with population genetic estimates of γ/c at telomeric regions of the X chromosome of D. melanogaster [94]. 10.1371/journal.pgen.1002905.g008 Figure 8 Relationship between crossing over (c) and gene conversion (γ) rates across the D. melanogaster genome. c and γ based on 100-kb adjacent windows with windows grouped into 6 categories of equal number according to c [CO1, CO2, .., CO6 indicating increasing rates of c] The average c (cM/Mb/female meiosis) values for the six categories is: 0.078 (CO1), 0.727 (CO2), 1.439 (CO3), 2.294 (CO4), 3.299 (CO5) and 5.964 (CO6). Blue columns show results when whole chromosomes are analyzed. Orange columns show results after removing centromeric and telomeric regions with visibly reduced CO rates. There is a significant negative relationship between c and γ using independent (non-overlapping) 100-kb windows: Spearman's R = −0.1246 (P = 1.6×10−5) for whole chromosomes and R = −0.1191 (P = 1.2×10−4) after removing telomeric/centromeric regions. Several hypotheses can be put forward to explain the different landscapes for CO and GC across the D. melanogaster genome. Divergent DSB repair pathways have been proposed in Drosophila as in yeast [100], [101], with a synthesis-dependent strand annealing (SDSA) pathway that is associated only with GC events, while the resolution of the double Holliday junction (DHJ) can generate either CO or GC (Figure S1). The detection of GC events in the fourth chromosome strongly suggests the action of SDSA, at least for a chromosome completely lacking CO, and indicates that SDSA may be acting across the whole genome. The observation that γ increases when c is low even after removing the fourth chromosome and telomeric/centromeric regions (see Figure 8) argues against the option of very large chromosomal domains with DSBs that are repaired by a different pathway. This same negative relationship between γ and c, together with the observed overlap in motifs associated with CO and GC events (see above), suggests a shared origin, and thus indicates that the disparity of landscapes for GC and CO rates across the D. melanogaster genome may be influenced by a difference in the relative usage of DSB repair pathways (eg., DHJ versus SDSA) or by a variable bias in the repair decision when DHJ intermediates are resolved to form either CO or GC products. In yeast, the presence of mismatches interferes with the formation and/or extension of heteroduplex intermediates during mitotic and meiotic DSB repair [102], and sequence divergence inhibits mitotic COs to a greater extent than GCs [103]. A recent genome-wide analysis of meiotic recombination intermediates in yeast, however, suggests that mismatch repair increases the ratio CO∶GC [104]. In agreement, analyses at the rosy locus in D. melanogaster show a small increase in the ratio CO∶GC in the presence of sequence polymorphisms (27 CO and 5 GC) when compared to a case where polymorphisms are virtually absent (23 CO and 8 GC events) [66]. If this tendency is confirmed across the Drosophila genome and if the differences in mismatch presence across the genome are of sufficient magnitude to alter either DHJ/SDSA relative use or the resolution of DHJ into CO or GC, then genomic regions with reduced heterozygosity could favor a DSB repair favoring GC over CO events [105]. At a whole-genome level, nucleotide differentiation between parental strains (ranging between 0.005 and 0.007 for total pairwise differences per bp) shows no association with overall γ/c, c or γ (P>0.4 in all cases). To test the possible influence of mismatch presence across the genome we investigated the correlation between levels of total nucleotide polymorphism (π) and the γ/c ratio based on adjacent 100-kb regions (Figure 9; see Materials and Methods). Congruent with the hypothesis that the choice to repair DBS into either GC or CO is heterozygosity-dependent, we observe a strong negative correlation between total π and γ/c across the whole genome (R = −0.56, P 0.40). The differential distribution of GC and CO when looking at genic and intergenic sequences is consistent with the heterozygosity-dependent GC∶CO repair of DSB proposed above, given that intergenic sequences have higher levels of heterozygosity than genic sequences. Overall, our data suggest a higher probability of DSBs within annotated transcriptional units. 10.1371/journal.pgen.1002905.g010 Figure 10 Recombination events within genic sequences in the D. melanogaster genome. Analyses based on 1,909 and 3,701 CO and GC events delimited by 500 bp or less (CO500 and GC500). (A) Frequency of recombination events (CO or GC) within genic sequences. Probability [P (Freq. Observed 2.5 kb) shows the median GC500 position shifting significantly relative to the TSS (from +556 in short transcripts to +3588 in long transcripts; Mann-Whitney test U = 51,192, P 100-kb) and small- (i.e., single-gene) scales, among crosses and for GC and CO. Each of these levels is expected to play a role in shaping evolutionary patterns. We propose that these overlapping, non-independent and dynamic landscapes of recombination should be included in a new generation of population genetic models of selection and recombination. Materials and Methods Drosophila melanogaster strains We generated eight crosses using 12 highly inbred strains of D. melanogaster. Ten of these strains were recently generated by the laboratory of Trudy Mackay from a natural population collected in Raleigh (NC, USA) after 20 generations of full sib mating (RAL strains). Importantly, these strains have maintained substantial genetic/phenotypic variation between lines [141], [142]. The RAL strains used in this project are: RAL-208, -301, -306, -375, -380, - 391, -514, -712, -786 and -820. We also used two non-US strains, collected in Madagascar (MDG) and Papua New Guinea (PNG) obtained from the Bloomington Drosophila Stock Center (Indiana University). Table 1 indicates the eight crosses under study. Our stocks of RAL-786 and -820 carried fixed inversions for chromosome arm 3R and therefore the analyses of recombination rates for this chromosome arm excluded these two crosses. 10.1371/journal.pgen.1002905.t001 Table 1 Cross, parental strains, and number of female meioses. Cross # Parental Strain 1 Parental Strain 2 Number of female meioses 1 PNG MDG 762.5 2 RAL-208 RAL-375 1522 3 RAL-306 RAL-391 493.5 4 RAL-375 RAL-514 720 5 RAL-208 PNG 752.5 6 RAL-301 RAL-375 627 7 RAL-712 RAL-786 499.5 8 RAL-380 RAL-820 483 The genomes of the RAL strains have been sequenced [The Drosophila Population Genomics Project (DPGP [17]), and The Drosophila Genetic reference Panel (DGRP [142]). Nevertheless, and for all strains including RALs, we obtained Illumina sequence reads and generated genomic sequences of the strains used in our laboratory for crosses to get an accurate (current) description of SNPs and small indels for all parental strains, including the possible presence of heterozygous sites. Crosses and generation of Recombinant Advanced Intercross Lines (RAILs) A number of factors can influence recombination in D. melanogaster including female age, temperature, number of matings or food [57]–[60]. To reduce the effects of these factors, all crosses and strains were maintained at the same temperature (23C) under a 12 hr. light/dark cycle, with constant fly density in half pint bottles on standard corn-meal media. Recombination is higher for young (1–2 days) and old (>12 days) females [59]. Therefore all crosses were carried out with flies of equivalent age, with 6 hour old virgin flies allowed to mate and lay eggs for a period of 36–48 hours, after which the parents were discarded and the offspring were allowed to emerge. For all crosses, parental strains were crossed in both directions to average out possible maternal effects on recombination rates. Standard approaches to assess rates of crossing over often cross two inbred strains and genotype progeny from F2 backcrosses. To maximize the number of recombination events per fly genotyped, we generated recombinant advanced intercross lines (RAIL). To generate RAILs by sibling mating and minimize homozygosity, we crossed five groups of forty F1 flies, twenty males and twenty females, randomly collected from the previous generation and put together in new bottles. The F1 flies were also allowed to mate for 1.5–2 days after which they were discarded and F2 offspring were allowed to emerge. To generate the next generation, flies from all bottles were randomly mixed to create the next groups of twenty males and twenty females. For all crosses, the number of bottles was increased progressively, starting with five up to ten, to reduce the effect of random genetic drift that could generate homozygous combinations for recombination events thus underestimating recombination rates. Our crosses varied in terms of sib-mating generations, ranging between 1 (standard approach with a single meiosis in heterozygous D. melanogaster females) and 10 generations. To estimate the number of female meioses associated with each cross we modeled our crossing design with simulations mimicking the corresponding mating process, including the number of individuals (males and females) and bottles per generation (see Table 1 for the number of meioses corresponding to each cross). The use of the ‘equivalent meioses’ estimated with this approach generated a genetic map length for D. melanogaster of 287.3 cM. This value closely matches classical measures (282 cM) in this species [26] thus indicating that our approach to estimate the number of meiosis captures appropriately the map expansion with number of generations and the crossing design. Table 2 shows the genetic map length for each cross and chromosome arm. 10.1371/journal.pgen.1002905.t002 Table 2 Genetic map length per chromosome arm and cross. Cross # X 2L 2R 3L 3R 1 71.8 49.8 45.8 54.9 47.6 2 66.2 55.8 53.8 52.7 60.8 3 56.4 64.9 60.0 49.9 56.7 4 63.4 62.5 51.4 51.8 61.3 5 67.5 55.6 49.0 55.9 56.2 6 66.6 52.4 55.0 50.3 66.5 7 73.0 55.9 58.6 53.4 * 8 57.7 57.5 59.3 53.9 * * Fly stocks RAL-786 and -820 carry fixed inversions for chromosome arm 3R. Finally, female D. melanogaster RAILs were crossed to D. simulans males to obtain hybrid females for genotyping. We originally used D. simulans males from strain w501, but we finally decided to use D. simulans males from the strain Florida City due to the much higher fraction of hybrids produced. D. melanogaster×D. simulans crosses were conducted using D. simulans males aged for 3–4 days before combining them for interspecific mating. After mating, D. simulans males were discarded and D. melanogaster females were individualized and maintained in vials to lay eggs. For each cross between two parental strains, we froze (−20°C) the hybrid offspring of more than 350 mated D. melanogaster RAIL females. To ensure replicates and resequencing if needed, we froze 6 hybrid offspring females from each D. melanogaster mated females. Figure S5 shows a schematic representation of the crossing methodology. Note that hybrids are used for genotyping only and the meiotic products from these hybrids are not analyzed. In total, we sequenced 2,829 hybrid females (with 2,829 single-fly Illumina libraries, see below), corresponding to 5,860 meioses. DNA extraction We extracted DNA from single hybrid females using Tissue Lyser LT (Qiagen, Valencia, CA) followed by a modified Qiagen DNeasy Blood & Tissue Kit (Qiagen, Valencia, CA) protocol. 90% of the eluted DNA was used for preparing a library for each individual fly. The rest of the sample was frozen and kept for posterior validation of markers, crossing over (CO) or gene conversion (GC) events if needed. The DNA of each fly was fragmented using a Bioruptor UCD-200 (Diagenode, Denville, NJ) with 0.5 ml tubes and settings that maximize the concentration of sheared DNA at 300 bp. Most of the eluted fragmented DNA was used for library preparation while the remaining (15%) of the sample was used for analysis of fragmented DNA for each individual fly by measuring the relative amount of a segment of the gene rp49 using Real-time PCR analysis (Roche LightCycler 480; Roche, IN, USA). This first quantification of single fly DNA served to confirm successful DNA extraction to be used for Illumina libraries. A second relative quantification of DNA was also performed after ligation of Illumina adapters (see below) following an equivalent protocol and used to normalize the amount of library DNA from each fly for multiplex sequencing. Illumina library preparation Single fly Illumina libraries of sheared DNA were prepared using NEBNext reagents and protocols (New England Biolabs, MA, USA) scaled-down appropriately for small amounts of DNA (detailed protocol available from J.M.C. upon request) and using indexed Illumina adapters. When sequencing multiplexed samples (using different tags) there is the possibility of mistakenly assigning reads to the wrong sample due to sequencing errors in the tag sequence. The likelihood of miss-assignment rapidly increases with, 1) shorter tags, 2) with the use of sequence reads not perfectly matching tag sequences and, 3) when tag sequences are only one or two nucleotides away from other tag-sequences. To reduce this potential error as much as possible our custom 7-nucleotide tags (7 nt plus the required ‘T’ for Illumina primer ligation) were designed to be a minimum of 4 nucleotide changes away from any other tag. Additionally, we only used Illumina reads with initial sequences completely identical to the tags used. This conservative approach is needed to prevent read miss-assignment in marker-based, light sequencing studies. Our 24 custom-designed 7-nucleotides tags are described in Table S1. The use of 24 indexed Illumina adapters allowed us to multiplex 24 single-fly libraries in a single Illumina lane to be later separated computationally. The PCR enriched libraries were validated by running an aliquot on a standard agarose gel. Final concentration quantification was obtained with Quant-iT PicoGreen dsDNA Reagent and Kits (Invitrogen, CA, USA) on a Turner BioSystems TBS-380 Fluorometer. Twenty four equimolar PCR-enriched libraries were pooled together to obtain a final set of multiplexed single-fly libraries that would be sequenced in a single Illumina lane. Sequencing was mostly performed using the Illumina Genome Analyzer IIx and Illumina HiSeq 2000 instruments at the Iowa State University DNA Facility with additional plates sequenced at the High Throughput Sequencing Facility of the University of North Carolina at Chapel Hill using the Illumina Genome Analyzer IIx. Genomic analyses and bioinformatic pipeline Generation and annotation of parental reference sequences We generated reference genomic sequences for all D. melanogaster parental strains and for the D. simulans Florida City strain that we used for our crosses. We obtained 75- and 100-nt Illumina single reads in excess of 25-fold coverage for the D. melanogaster RAL strains and in excess of 40-fold coverage for D. melanogaster MDG and PNG and for D. simulans Florida City strains. To generate RAL reference parental sequences we added the sequencing reads from the NCBI short read archive (SRA) study SRP000694 [142] to our 75-nt or 100-nt reads. Filtering of reads, mapping and generation of consensus reference sequences for parental strains was carried out using the FASTX-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/), BWA [143], SAMtools v1.4 [144] and a collection of custom PERL scripts using as reference the D. melanogaster genome sequence (r.5.30; http://flybase.org). Contrary to standard approaches to generating consensus sequences based on SNP calling, we generated parental reference sequences specifically intended for our mapping purposes. We focused on taking into account heterozygous sites in parental strains that could miss-assign the origin of individual reads as well as annotate as unreliable sites those sites with limited representation (coverage). Two distinct issues associated with heterozygosity within strains were detected. First, residual heterozygosity (present when the lines were originally sequenced, ca. 2008–2009) and maintained in the strain that was used in our laboratory for crosses. Second, sites showing a different high-frequency/monomorphic variant in our laboratory relative to when they were originally sequenced. We annotated (marked) each potential heterozygous site in the reference sequence of parental strains as ambiguous sites using the appropriate IUPAC ambiguity code using a permissive approach. We used full (raw) pileup files and conservatively considered as heterozygous site any site with a second (non-major) nucleotide at a frequency higher than 5% regardless of consensus and SNP quality. For instance, if the sequencing of a parental strain of D. melanogaster generates 12 reads exhibiting an ‘A’ and 1 read exhibiting a ‘G’ at a particular nucleotide position, the reference will be marked as ‘R’ even if consensus and SNP qualities are 60 and 0, respectively. We assigned ‘N’ to all nucleotide positions with coverage less that 7 regardless of consensus quality because of the lack of information on their heterozygous nature. We also assigned ‘N’ to positions with more than 2 nucleotides. This approach is conservative when used for marker assignment because the mapping protocol (see below) will remove heterozygous sites from the list of informative sites/markers while also introducing a “trapping” step for Illumina sequencing errors that may be not fully random. Finally we introduced insertions and deletions for each parental reference sequence based on raw pileup files. Mapping of reads and generation of D. melanogaster recombinant haplotypes Sequences were first pre-processed and only reads with sequences exact to one of tags were used for posterior filtering and mapping. FASTQ reads were quality filtered and 3′ trimmed, retaining reads with at least 80% percent of bases above quality score of 30, 3′ trimmed with minimum quality score of 12 and a minimum of 40 bases in length. Any read with one or more ‘N’ was also discarded. This conservative filtering approach removed an average of 22% of reads (between 15 and 35% for different lanes and Illumina platforms). We then eliminated all reads with possible D. simulans Florida City origin, either truly originating from the D. simulans chromosomes or with D. melanogaster origin but similar to a D. simulans sequence. We used MOSAIK assembler (http://bioinformatics.bc.edu/marthlab/Mosaik) to map reads to our marked D. simulans Florida City reference sequence. Contrary to other aligners, MOSAIK can take full advantage of the set of IUPAC ambiguity codes during alignment and for our purposes this allows the mapping and removal of reads when represent a sequence matching a minor allele within a strain. Moreover, MOSAIK was used to map reads to our marked D. simulans Florida City sequences allowing 4 nucleotide differences and gaps to remove D. simulans -like reads even with sequencing errors. We further eliminated D. simulans -like sequences by mapping remaining reads to all available D. simulans genomes and large contig sequences [Drosophila Population Genomics Project; DPGP, http://www.dpgp.org/] using the program BWA and allowing 3% mismatches. The additional D. simulans sequences were obtained from the DPGP website http://www.dpgp.org/ and included the genomes of six D. simulans strains [w501, C167, MD106, MD199, NC48 and sim4+6; [17]] as well as contigs not mapped to chromosomal locations. After removing reads potentially from D. simulans we wanted to obtain a set of reads that mapped to one parental strain and not to the other (informative reads). We first generated a set of reads that mapped to at least one of the parental reference sequences with zero mismatches and no indels. At this point we split the analyses into the different chromosome arms. To obtain informative reads for a chromosome we removed all reads that mapped to our marked sequences from any other chromosome arm in D. melanogaster, using MOSAIK to map to our marked reference sequences (the strain used in the cross as well as from any other sequenced parental strain) and using BWA to map to the D. melanogaster reference genome. We then obtained the set of reads that uniquely map to only one D. melanogaster parental strain with zero mismatches to the marked reference sequence of the chromosome arm under study in one parental strain but not in the other, and vice versa, using MOSAIK. Reads that could be miss-assigned due to residual heterozygosity or systematic Illumina errors would be removed in this step. Before considering the remaining reads as informative markers we incorporated additional filtering steps. We removed all reads that would differentially map to one parental reference sequence and not to the other if one of the reference sequences corresponding to this read contained one or more ‘N’s. Although the dominant error type in Illumina sequencing is substitutions (as opposed to indels), indels are more difficult to deal with and to incorporate uncertainty and therefore we removed reads that would differentially map to one parental reference sequence and not to the other if the parental reference sequences at this region differ by an indel. We also removed reads that showed nucleotide differences between parental strains at the 5′ or 3′-end of the read. Finally, we considered informative reads only those that distinguish between parental reference sequences with a maximum of 3 single-nucleotide, non-consecutive differences. The informative reads from each genotyped hybrid female were used to generate D. melanogaster (possibly recombinant) haplotypes for each chromosome arm. Detection of CO and GC events We mapped CO and GC events directly to each individual D. melanogaster haplotype (from a single RAIL hybrid) and not based on the combined analysis of all D. melanogaster haplotypes for a given chromosomal region. That is, most chromosomes only show 1 to 4 (in the case of several generations of RAIL) CO events. Due to the elevated density of markers each CO is supported by numerous (hundreds and often thousands) contiguous markers at either side and therefore we expect to have detected all COs. GC events on the other hand are supported by single or a few adjacent markers that do not extend over long stretches of DNA (i.e., much shorter than 25 kb). In principle, double CO in a single meiosis (or two independent COs in different meiosis) could be mistaken for long GC if they were very close to each other. GC events are assumed to be short, often shorter than 500 bp and be exceptionally rare above 10–15 kb [62]. We analyzed marker maps to locate CO and GC events along single chromosomes, using a cut-off for maximum tract length for GC (LGC ) of 15 kb. Several lines of evidence suggest that this approach classifies correctly CO and GC events based on our experimental design. First, equivalent maps for CO and GC were obtained when applying a cut-off of 25 kb, suggesting that at 15 kb we are classifying as GC most if not all detectable GC events and that, when two or more CO events occur in the same chromosome in our RAILs, these COs are separated by more than 25 kb. Second, crosses involving several generations of RAIL show equivalent number of CO per chromosome per female meiosis to crosses based on a single meiosis. Finally, simulations of CO distribution along chromosomes following the mating protocol used to generate RAILs, with a cut-off of 15 kb to assign GC and the conservative assumption of no CO interference shows a maximum erroneous assignment of 0.16% and 1.4% assuming random distribution or the observed distribution of CO, respectively. We detected a total of 32,511 CO events and CO maps for each cross and chromosome arm were generated by directly combining the observed COs from all individual haplotypes and tabulated along each chromosome in terms of c [centimorgans (cM) per megabase (Mb) per female meiosis]. Estimates of gene conversion initiation rates (γ) and GC tract length (LGC ) Our study revealed a total of 74,453 GC events. Nevertheless, a fraction of GC events are expected to be missed due to GC tracts that lay between adjacent markers. Moreover, this underestimation is predicted to be variable across the genome due to differences in SNP and marker density. Our data consists of a great many independent GC events distributed across different haplotypes for a given chromosome, each GC event likely defined by different SNPs and a different distance from adjacent SNPs. The nature of this dataset differs from previous population genetic studies of gene conversion [94], [145] as well as from experimental studies that based their results on genetic crosses that directly detected presence/absence of GC events using a limited number of informative markers and/or focused on a specific genomic region [62], [112]. SNPs not involved in GC events, each separated by a different distance from adjacent SNPs, are also informative about the rate of GC initiation (γ) and length of GC tracts (LGC ). We therefore expanded a previous maximum likelihood algorithm [62] to estimate simultaneously γ and LGC and to be applicable to any region of arbitrary size with variable density SNP/marker data that takes into account both observed GC events and markers not involved in GC events. Each observed, unselected, GC tract will be treated as a different event defined by the outmost markers (left and right nucleotides) of the observed tract that describe the minimum true tract length (L min; L min≥1). We also know that a tract has a left end and a right end delimited by the nearest left/right flanking markers not involved in the tract, with mgc indicating the average number of nucleotides between the observed tract and the left and right flanking markers. The maximum tract length (L max) is then L min+2(mgc ). Following Hilliker et al. (1994) [62], gene conversion tract lengths can be described by a geometric distribution that assumes independence of each nucleotide-adding step with a probability φ. The probability of a GC tract of length n nucleotides can be described by with the mean tract length The likelihood of an observed GC event that encompasses the observed tract is then Markers not involved in GC tracts either due to no GC event or because GC tracts initiate and terminate between two 2 markers are also informative. These markers are separated by m nucleotides and we preserve the possibility that m differs from mgc . Let 1- φn denote the probability of a GC tract shorter than n nucleotides. Then For a complete dataset with k GC events and t markers not being involved in GC events, the total Likelihood of the data is or its log for convenience. Finally we can obtain numerically the Maximum Likelihood Estimate (MLE) of γ and LGC using the log-likelihood function for our dataset(s). We have applied this approach to estimate γ and length LGC for the whole genome as well as for each and along chromosome arms. Validation In silico False Discovery Rate (FDR) analysis Although we have strived for designing a protocol that includes a hefty number of filters and mapping controls, we anticipate a non-zero rate of misplacing reads given the massive number of reads obtained per cross. We estimated our false discovery rate (FDR) for CO and GC events by generating random collections of Illumina reads when there is no expectation of detecting any recombination (CO or GC) event. We applied the same bioinformatic pipeline used to identify informative markers, generate D. melanogaster haplotypes and ultimately identify CO and GC events and estimate c and γ. We investigated the efficacy of our filtering/mapping protocol by generating collections of reads with 50% of reads from a single parental D. melanogaster (eg, RAL-208) and 50% of reads from the D. simulans strain used in all crosses (Florida City) to closely represent the reads from a single hybrid female fly when there is no expectation for any CO or GC event. The reads used for this study were obtained from our Illumina sequencing effort of parental D. melanogaster and the D. simulans strains used in this study (see above) and were used with no a priori knowledge of their sequence and mapping quality, Each in silico library is, on average, equivalent to individual hybrid libraries in terms of number of reads with the only difference that we removed the first 8 nucleotides of each read from the parental lines (equivalent to the removal of the 5′ (7 nt+‘T’) tag in our multiplexed hybrid reads). This approach to estimate FDR takes into account possible limitations in the filtering and mapping algorithms and protocols, Illumina sequencing errors (random and non-random), the effects of non-complete or inaccurate reference sequences and the bioinformatic pipeline. We generated 400 in silico random library collections (the average number of libraries per cross), applied the same bioinformatic pipeline and parameters used for the filtering and mapping of reads from our crosses and estimated CO and GC rates. Because the expectation is zero for both CO and GC we can compare these rates to those from actual crosses to obtain a suitable FDR. Our results show that no CO event would be inferred when using only one D. melanogaster parental strain and D.simulans (zero events in all 400 in silico libraries compared to the more than 2,000 detected per cross). GC events are however detected. Overall, we can infer that 4.1% of our inferred GC events can be explained by miss-assigned reads and that most of these erroneously mapped reads are from the D. melanogaster strain, not from the parental D.simulans. This FDR varies among chromosomes, highest and lowest for the 3R (6.2%) and X (1.9%) chromosome arms, respectively. Zero GC events (in 400 in silico libraries) were inferred in the small chromosome 4. Worthy of note, preliminary FDR studies based on more straightforward filtering/mapping protocols revealed higher percentages of miss-assigned reads that would overestimate the number of GC events. These results guided us to include additional steps to our bioinformatic pipeline. The additional steps included in our final protocols (see above) consisted in, 1) increased filtering stringency of reads before mapping, 2) use of additional D. simulans sequences and contigs (additional to the Florida City reference) for ‘trapping’ purposes of D. simulans -like sequences, 3) removal of reads mapping to chromosomes other than the one under study (with permissive mapping allowing up to 10% mismatch), and 4) removal of reads that differentiate parental sequences if these sequences differ by an indel. Experimental PCR/sequencing of gene conversion events When we extracted DNA from single hybrid females we froze ∼10% of the total DNA for posterior validation of markers, crossing over (CO) or gene conversion (GC) events if needed. We designed primers, PCR amplified and sequenced the genomic regions surrounding 31 GC events detected based on our bioinformatic pipeline. These 31 GC events were chosen to represent GC events assessed based on a single informative marker, in the fourth chromosome or with long GC tract (>2 kb). 30 out of these 31 showed SNPs patterns confirming the presence of a GC event. Five of these confirmed GC events were located in the fourth chromosome. Overall the GC events ranged in tract length, from 171 bp to a maximum of 9,585, the latter located in the X chromosome (ca. position 2.6 Mb). Estimates of recombination heterogeneity along chromosomes Heterogeneity in recombination events (CO or GC) along chromosome arms was investigated as the probability of observing equal or greater coefficient of variation (CV) among non-overlapping regions. The size of the genomic regions ranged from 100-kb up to half the chromosome length (10 Mb) in 100-kb increments. Probabilities were obtained based on 10,000 replicates per window size and chromosome, and assuming a random distribution of events. Heterogeneity was investigated for whole chromosomes and after removing centromeric and telomeric regions with visibly reduced CO rates (see below). Estimates of variation in recombination maps among crosses We studied intra-specific variation in CO rates by estimating the variance to mean ratio, or Index of Dispersion (R), and tested whether the observed variance in CO events among crosses can be explained by a Poisson process. This approach assumes that for a given chromosomal region the number of observed CO events in different crosses is a set of independent and identically distributed random variables. In the case of Poisson distributed data, the expectation is R∼1. To focus our study on variation in the distribution of CO rates and take into account genome-wide differences between crosses and the different number of meiosis analyzed, we used weighting factors (w) that include the total number of CO per chromosome in a particular cross as ‘lineage effects’. The weighting factor of cross i for a given chromosome (wi ) is then with mi representing the total number of CO events in a chromosome in cross i, and n the number of crosses analyzed (in our case n = 8). Following Gillespie (1989) [146], we can obtain the index of dispersion for CO events (R CO) for region r based on with where , and indicate the mean number of recombination events at region r, the number of CO events in region r for cross i, and the unbiased estimator of the variance at region r, respectively. Assuming a Poisson distribution of CO events, R CO values significantly greater than 1 would indicate an excess variance (overdispersion) of CO events among crosses for the region under study. We estimated the statistical significance of R CO values being greater than expected using simulations that assume a Poisson-distributed total number of CO events that are randomly distributed among crosses (once corrected for lineage effects) and along chromosomes. Note that the variance due to binomial sampling is greater than that for Poisson, thus the expectation for binomial distributed data is R<1, making our approach conservative when assessing overdispersion of CO events among crosses. Probabilities were obtained based on 100,000 independent replicates. R CO along chromosomes was investigated for non-overlapping regions with sizes ranging from 250 kb to 1 Mb. Local distribution of CO and GC events Because we focused on CO and GC events delimited by 500 bp or less (CO500 and GC500), heterogeneity in the density of informative markers across the genome could bias our expected, null, distribution of CO500 and GC500 events. Indeed, informative markers are 1.4% more likely to be located in intergenic regions than within genic regions. We then used this “marker-density corrected” approach to obtain the expected frequency of CO500 and GC500 recombination events within genic sequences (0.607 instead of 0.637 obtained when assuming random distribution of markers). DNA motifs We used MEME (version 4.6.1) [147], a software for discovering motifs in sets of DNA sequences, to search for motifs enriched in sequences around CO and GC events. We focused on 1,909 CO and 3,701 GC events delimited by 500 bp or less (CO500 and GC500). When the length of the sequence delimited by adjacent markers was less than 500 bp we extended the sequence both at the 5′ and 3′ ends up to a total of 500 bp. We searched for motifs with nucleotide size ranging between 5 and 20 nucleotides and used as a background sequence model one that takes into account both nucleotide and dinucleotide composition (first-order Markov model), the latter important to capture the observed frequency of dinucleotide repeats across the genome. MEME was applied to search for motifs on both the given DNA strand and the reverse complement strand and allowing any number of repetitions per sequence using the complete 500 bp sequences. We also used MEME to generate sequence LOGOS for each discovered motif. Figure 6 and Figure 7 show the 18 and 10 motifs enriched in CO500 and GC500, respectively, with corrected probabilities (E-values) smaller than 10−10. E-values represent the expected number of sequences in a random database of the same size that would match the motifs as well as the sequence does. To obtain comparable E-values for CO and GC events we applied MEME to a random set of 1,000 CO500 and 1,000 GC500 sequences. We also investigated the commonness of the motifs with E-value<10−10 within CO500 and GC500 sequences (Figure S4). Centromeric and telomeric regions with reduced CO rates For the sake of analyzing the distribution of recombination rates along chromosomes and their possible causes or consequences we designated centromeric and telomeric regions with visibly reduced CO (see Figure 1) as possibly influencing some of our analyses (eg, heterogeneity along chromosomes, CO vs GC rates, local distribution of CO and GC events, etc.). Thus most analyses were performed using whole chromosomes as well as after removing the fourth chromosome (which has no detectable CO) and these long, broadly defined peripheral regions (see Table 3). 10.1371/journal.pgen.1002905.t003 Table 3 Genomic regions investigated after removing centromeric and telomeric regions. Chromosome arm Genomic region* X 2.3 — 20.8 Mb 2L 0.5 — 17.4 Mb 2R 5.2 — 20.8 Mb 3L 0.7 — 19.9 Mb 3R 9.7 — 26.9 Mb * Genomic location based on the nucleotide position in the D. melanogaster reference (r.5.3) genome sequence. Nucleotide composition across the D. melanogaster genome We investigated the possible relationship between recombination rates (c or γ) and nucleotide composition across the D. melanogaster genome to test the hypothesis of a bias in heteroduplex repair with AT/GC heterozygotes being preferentially repaired towards G and C nucleotides (aka, gene conversion bias). Our analyses of nucleotide composition in the D. melanogaster genome were based on the D. melanogaster reference sequence (release 5.3, January 2011). We annotated all gene models (not only protein coding genes), transposable elements and repetitive sequences onto the reference sequence and marked any overlapping annotation. To test for gene conversion bias in D. melanogaster, we focused on G+C nucleotide composition at intergenic and intronic sites using adjacent 100-kb regions. Intergenic sites were defined as those sites between any annotated gene model (thus also excluding annotated UTRs), transposable elements or repetitive sequences. Intronic sites were defined as those that are only associated with an intron definition in protein coding genes (i.e., sites that are either ‘intron’ or ‘exon’ depending on alternatively spliced forms were not considered) and do not overlap with any other annotation. Nucleotide polymorphism across the D. melanogaster genome For the sake of obtaining information on polymorphism levels across the genome we used DGRP sequenced strains [142], all from the same natural population (Raleigh, NC, USA) than the RAL strains used in this study, and added our sequencing data from the parental RAL strains. We also included the sequenced strains MDG and PNG in the overall analysis of polymorphism. Among all DGRP strains, we chose the 34 strains with 454-Roche as well as Illumina sequencing reads to maximize SNP accuracy. DGRP sequencing reads were obtained from NCBI short read archive (SRA) study SRP000694. For each strain, we obtained reference sequences using the same mapping procedure described above now using BWA-SW [148] to map additional 454 reads. We called SNPs relative to the D. melanogaster genome reference when SNP quality was greater than Q40 and generated consensus sequences without ‘marking’ heterozygous sites. We estimated nucleotide polymorphism as the pairwise nucleotide variation per site (π), with X-linked values adjusted to generate comparable estimates to autosomal regions. To test the hypothesis of a heterozygosity-dependent GC∶CO repair bias imposed after DSB formation in D. melanogaster, we investigated the relationship between polymorphism and the ratio γ/c or γ across the genome using total π. To study the relationship between polymorphism and c we focused on polymorphism at noncoding sites (intergenic and introns). Analyses comparing estimates of π with either γ/c, γ or c across the genome were based on values from 100-kb adjacent windows. Data availability Estimates of recombination reported in this study are publicly available, for any genomic region or gene in D. melanogaster, from www.recombination.biology.uiowa.edu and www.recombinome.com. Supporting Information Figure S1 Double-strand repair (DSB) pathway and recombination during meiosis. The DSB repair via double Holliday junction can generate either crossover (CO) or non-crossover (gene conversion, GC) events while the Holliday junction-independent repair (synthesis-dependent strand annealing, SDSA) mechanism causes only GC events. (TIF) Click here for additional data file. Figure S2 Crossing over rates (c) along chromosome arms based on nth polynomial equations relating the physical position to c. A single polynomial has been applied per chromosome arm, with the physical position (x) in Mb and c (y) in cM/Mb. The equations best fitting adjacent 100-kb windows are the following: (TIF) Click here for additional data file. Figure S3 Relationship between sample mean and sample median for rates of crossing over (c). (A) Sample mean and sample median for c across the D. melanogaster genome estimated for adjacent 250-kb windows and shown separately for each chromosome arm. c indicates cM/Mb per female meiosis. (B) Blowup of the relationship between c based on sample mean and c based on sample median when c (sample mean) is smaller than 3 cM/Mb. (TIF) Click here for additional data file. Figure S4 Cumulative percentage of sequences with CO and GC motifs. Percentage of sequences with one or more of the 18 CO and 10 GC motifs found to be overrepresented in CO500 and GC500 sequences, respectively (see Figure 7 and Figure 8 for details). (TIF) Click here for additional data file. Figure S5 Schematic representation of the crossing methodology used in this study to generate recombinant chromosomes between D. melanogaster parental strains. (TIF) Click here for additional data file. Table S1 24 custom-designed 7-nucleotides tags. (DOC) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Genet
                PLoS Genet
                plos
                plosgen
                PLoS Genetics
                Public Library of Science (San Francisco, USA )
                1553-7390
                1553-7404
                November 2014
                6 November 2014
                : 10
                : 11
                : e1004775
                Affiliations
                [1 ]Department of Biology, Stanford University, Stanford, California, United States of America
                [2 ]Department of Biology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
                University of Texas at Austin, United States of America
                Author notes

                The authors have declared that no competing interests exist.

                Conceived and designed the experiments: AOB ELB KRO PSS DAP. Analyzed the data: AOB PSS DAP. Contributed reagents/materials/analysis tools: AOB ELB KRO PSS. Wrote the paper: AOB PSS DAP.

                Article
                PGENETICS-D-14-00537
                10.1371/journal.pgen.1004775
                4222749
                25375361
                d28341d7-dd72-4fc0-bc5e-6be457089bef
                Copyright @ 2014

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 24 February 2014
                : 24 September 2014
                Page count
                Pages: 19
                Funding
                This work was supported by NIH NRSA fellowship F32GM097837 to AOB, by NSF GRF DGE-0822 to ELB, by NIH RO1GM100366 grant to PSS and DAP, by NIH RO1GM097415 grant to DAP and by NSF DEB 0921307 to PSS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Biochemistry
                DNA
                Ecology
                Ecological Economics
                Evolutionary Ecology
                Evolutionary Biology
                Evolutionary Genetics
                Evolutionary Processes
                Population Genetics
                Genetics
                Genetic Loci
                Phenotypes

                Genetics
                Genetics

                Comments

                Comment on this article