51
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster

      research-article
      1 , 1 , 1 , 2 , *
      PLoS Genetics
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Estimating fine-scale recombination maps of Drosophila from population genomic data is a challenging problem, in particular because of the high background recombination rate. In this paper, a new computational method is developed to address this challenge. Through an extensive simulation study, it is demonstrated that the method allows more accurate inference, and exhibits greater robustness to the effects of natural selection and noise, compared to a well-used previous method developed for studying fine-scale recombination rate variation in the human genome. As an application, a genome-wide analysis of genetic variation data is performed for two Drosophila melanogaster populations, one from North America (Raleigh, USA) and the other from Africa (Gikongoro, Rwanda). It is shown that fine-scale recombination rate variation is widespread throughout the D. melanogaster genome, across all chromosomes and in both populations. At the fine-scale, a conservative, systematic search for evidence of recombination hotspots suggests the existence of a handful of putative hotspots each with at least a tenfold increase in intensity over the background rate. A wavelet analysis is carried out to compare the estimated recombination maps in the two populations and to quantify the extent to which recombination rates are conserved. In general, similarity is observed at very broad scales, but substantial differences are seen at fine scales. The average recombination rate of the X chromosome appears to be higher than that of the autosomes in both populations, and this pattern is much more pronounced in the African population than the North American population. The correlation between various genomic features—including recombination rates, diversity, divergence, GC content, gene content, and sequence quality—is examined using the wavelet analysis, and it is shown that the most notable difference between D. melanogaster and humans is in the correlation between recombination and diversity.

          Author Summary

          Recombination is a process by which chromosomes exchange genetic material during meiosis. It is important in evolution because it provides offspring with new combinations of genes, and so estimating the rate of recombination is of fundamental importance in various population genomic inference problems. In this paper, we develop a new statistical method to enable robust estimation of fine-scale recombination maps of Drosophila, a genus of common fruit flies, in which the background recombination rate is high and natural selection has been prevalent. We apply our method to produce fine-scale recombination maps for a North American population and an African population of D. melanogaster. For both populations, we find extensive fine-scale variation in recombination rate throughout the genome. We provide a quantitative characterization of the similarities and differences between the recombination maps of the two populations; our study reveals high correlation at broad scales and low correlation at fine scales, as has been documented among human populations. We also examine the correlation between various genomic features. Furthermore, using a conservative approach, we find a handful of putative recombination “hotspot” regions with solid statistical support for a local elevation of at least 10 times the background recombination rate.

          Related collections

          Most cited references47

          • Record: found
          • Abstract: found
          • Article: not found

          The fine-scale structure of recombination rate variation in the human genome.

          The nature and scale of recombination rate variation are largely unknown for most species. In humans, pedigree analysis has documented variation at the chromosomal level, and sperm studies have identified specific hotspots in which crossing-over events cluster. To address whether this picture is representative of the genome as a whole, we have developed and validated a method for estimating recombination rates from patterns of genetic variation. From extensive single-nucleotide polymorphism surveys in European and African populations, we find evidence for extreme local rate variation spanning four orders in magnitude, in which 50% of all recombination events take place in less than 10% of the sequence. We demonstrate that recombination hotspots are a ubiquitous feature of the human genome, occurring on average every 200 kilobases or less, but recombination occurs preferentially outside genes.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Fine-scale recombination rate differences between sexes, populations and individuals.

            Meiotic recombinations contribute to genetic diversity by yielding new combinations of alleles. Recently, high-resolution recombination maps were inferred from high-density single-nucleotide polymorphism (SNP) data using linkage disequilibrium (LD) patterns that capture historical recombination events. The use of these maps has been demonstrated by the identification of recombination hotspots and associated motifs, and the discovery that the PRDM9 gene affects the proportion of recombinations occurring at hotspots. However, these maps provide no information about individual or sex differences. Moreover, locus-specific demographic factors like natural selection can bias LD-based estimates of recombination rate. Existing genetic maps based on family data avoid these shortcomings, but their resolution is limited by relatively few meioses and a low density of markers. Here we used genome-wide SNP data from 15,257 parent-offspring pairs to construct the first recombination maps based on directly observed recombinations with a resolution that is effective down to 10 kilobases (kb). Comparing male and female maps reveals that about 15% of hotspots in one sex are specific to that sex. Although male recombinations result in more shuffling of exons within genes, female recombinations generate more new combinations of nearby genes. We discover novel associations between recombination characteristics of individuals and variants in the PRDM9 gene and we identify new recombination hotspots. Comparisons of our maps with two LD-based maps inferred from data of HapMap populations of Utah residents with ancestry from northern and western Europe (CEU) and Yoruba in Ibadan, Nigeria (YRI) reveal population differences previously masked by noise and map differences at regions previously described as targets of natural selection.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans

              Introduction Given the long history of Drosophila as a central model system in evolutionary genetics beginning with the origins of empirical population genetics in the 1930s, it is unsurprising that Drosophila data have inspired the development of methods to test population genetic theories using DNA variation within and between closely related species [1–4]. These methods rest on the supposition of the neutral theory of molecular evolution that polymorphism and divergence are manifestations of mutation and genetic drift of neutral variants at different time scales [5]. Under neutrality, polymorphism is a “snapshot” of variation, some of which ultimately contributes to species divergence as a result of fixation by genetic drift. Natural selection, however, may cause functionally important variants to rapidly increase or decrease in frequency, resulting in patterns of polymorphism and divergence that deviate from neutral expectations [1,2,6]. A powerful aspect of inferring evolutionary mechanism in this population genetic context is that selection on sequence variants with miniscule fitness effects, which would be difficult or impossible to study in nature or in the laboratory but are evolutionarily important, may cause detectable deviations from neutral predictions. Another notable aspect of these population genetic approaches is that they facilitate inferences about recent selection—which may be manifest as reduced polymorphism or elevated linkage disequilibrium—or about selection that has occurred in the distant past—which may be manifest as unexpectedly high levels of divergence. The application of these conceptual advances to the study of variation in closely related species has resulted in several fundamental advances in our understanding of the relative contributions of mutation, genetic drift, recombination, and natural selection to sequence variation. However, it is also clear that our genomic understanding of population genetics has been hobbled by fragmentary and nonrandom population genetic sampling of genomes. Thus, the full value of genome annotation has not yet been applied to the study of population genetic mechanisms. Combining whole-genome studies of genetic variation within and between closely related species (i.e., population genomics) with high-quality genome annotation offers several major advantages. For example, we have known for more than a decade that regions of the genome experiencing reduced crossing over in Drosophila tend to show reduced levels of polymorphism yet normal levels of divergence between species [7–10]. This pattern can only result from natural selection reducing levels of polymorphism at linked neutral sites, because it violates the neutral theory prediction of a strong positive correlation between polymorphism and divergence [5]. However, we have no general genomic description of the physical scale of variation in polymorphism and divergence in Drosophila and how such variation might be related to variation in mutation rates, recombination rates, gene density, natural selection, or other factors. Similarly, although several Drosophila genes have been targets of molecular population genetic analysis, in many cases, these genes were not randomly chosen but were targeted because of their putative association with phenotypes thought to have a history of adaptive evolution [11,12]. Such biased data make it difficult to estimate the proportion of proteins diverging under adaptive evolution. In a similar vein, the unique power of molecular population genetic analysis, when used in concert with genome annotation, could fundamentally alter our notions about phenotypic divergence due to natural selection. This is because our current understanding of phenotypic divergence and its causes is based on a small and necessarily highly biased description of phenotypic variation. Alternatively, a comprehensive genomic investigation of adaptive divergence could use genome annotations to reveal large numbers of new biological processes previously unsuspected of having diverged under selection. Here we present a population genomic analysis of D. simulans. D. simulans and D. melanogaster are closely related and split from the outgroup species, D. yakuba, several million years ago [13–15]. The vast majority of D. simulans and D. yakuba euchromatic DNA is readily aligned to D. melanogaster, which permits direct use of D. melanogaster annotation for investigation of polymorphism and divergence and allows reliable inference of D. simulans–D. melanogaster ancestral states over much of the genome. Our analysis uses a draft version of a D. yakuba genome assembly (aligned to the D. melanogaster reference sequence) and a set of light-coverage, whole-genome shotgun data from multiple inbred lines of D. simulans, which were syntenically aligned to the D. melanogaster reference sequence. Results/Discussion Genomes and Assemblies Seven lines of D. simulans and one line of D. yakuba were sequenced at the Washington University Genome Sequencing Center (the white paper can be found at http://www.genome.gov/11008080). The D. simulans lines were selected to capture variation in populations from putatively ancestral geographic regions [16], recent cosmopolitan populations, and strains encompassing the three highly diverged mitochondrial haplotypes previously described for the species [17]. These strains have been deposited at the Tucson Drosophila Stock Center (http://stockcenter.arl.arizona.edu). A total of 2,424,141 D. simulans traces and 2,245,197 D. yakuba traces from this project have been deposited in the National Center for Biotechnology Information (NCBI) trace archive. D. simulans syntenic assemblies were created by aligning trimmed, uniquely mapped sequence traces from each D. simulans strain to the euchromatic D. melanogaster reference sequence (v4). Two strains from the same population, sim4 and sim6, were unintentionally mixed prior to library construction; reads from these strains were combined to generate a single, deeper, syntenic assembly (see Materials and Methods), which is referred to as SIM4/6. The other strains investigated are referred to as C167.4, MD106TS, MD199S, NC48S, and w501 . Thus, six (rather than seven) D. simulans syntenic assemblies are the objects of analysis. Details on the fly strains and procedures used to create these assemblies, including the use of sequence quality scores, can be found in Materials and Methods. The coverages (in Mbp) for C167.4, MD106TS, MD199S, NC48S, SIM4/6, and w501 , are 56.9, 56.3, 63.4, 42.6, 89.8, and 84.8, respectively. A D. yakuba strain Tai18E2 whole-genome shotgun assembly (v2.0; http://genome.wustl.edu/) generated by the Parallel Contig Assembly Program (PCAP) [18] was aligned to the D. melanogaster reference sequence (Materials and Methods). The main use of the D. yakuba assembly was to infer states of the D. simulans–D. melanogaster ancestor. For many analyses, we used divergence estimates for the D. simulans lineage or the D. melanogaster lineage (from the inferred D. simulans–D. melanogaster ancestor) rather than the pairwise (i.e., unpolarized) divergence between these species. These lineage-specific estimates are often referred to as “D. simulans divergence,” “D. melanogaster divergence,” or “polarized divergence.” A total of 393,951,345 D. simulans base pairs and 102,574,197 D. yakuba base pairs were syntenically aligned to the D. melanogaster reference sequence. Several tens of kilobases of repeat-rich sequences near the telomeres and centromeres of each chromosome arm were excluded from our analyses (Materials and Methods). D. simulans genes were conservatively filtered for analysis based on conserved physical organization and reading frame with respect to the D. melanogaster reference sequence gene models (Materials and Methods). We took this conservative approach so as to retain only the highest quality D. simulans data for most inferences. The number of D. simulans genes remaining after filtering was 11,466. Ninety-eight percent of coding sequence (CDS) nucleotides from this gene set are covered by at least one D. simulans allele. The average number of lines sequenced per aligned D. simulans base was 3.90. For several analyses in which heterozygosity and divergence per site were estimated, we further filtered the data so as to retain only genes or functional elements (e.g., untranslated regions [UTRs]) for which the total number of bases sequenced across all lines exceeded an arbitrary threshold (see Materials and Methods). The numbers of genes for which we estimated coding region expected heterozygosity, unpolarized divergence, and polarized divergence were 11,403, 11,439, and 10,150, respectively. Coverage on the X chromosome was slightly lower than autosomal coverage, which is consistent with less X chromosome DNA than autosomal DNA in mixed-sex DNA preps. Variable coverage required analysis of individual coverage classes (n = 1–6) for a given region or feature, followed by estimation and inference weighted by coverage (Materials and Methods). The D. simulans syntenic alignments are available at http://www.dpgp.org/. An alternative D. simulans “mosaic” assembly, which is available at http://www.genome.wustl.edu/, was created independently of the D. melanogaster reference sequence. General Patterns of Polymorphism and Divergence Nucleotide variation. We observed 2,965,987 polymorphic nucleotides, of which 43,878 altered the amino acid sequence; 77% of sampled D. simulans genes were segregating at least one amino acid polymorphism. The average, expected nucleotide heterozygosity (hereafter, “heterozygosity” or “πnt”) for the X chromosome and autosomes was 0.0135 and 0.0180, respectively. X chromosome πnt was not significantly different from that of the autosomes (after multiplying X chromosome πnt by 4/3, to correct for X/autosome effective population size differences when there are equal numbers of males and females; see [19]). However, X chromosome divergence was greater than autosomal divergence in all three lineages (50-kb windows; Table 1, Table S1, Figure 1, Dataset S8). We will discuss this pattern in greater detail below. Table 1 Autosome and X Chromosome Weighted Averages of Nucleotide Heterozygosity (π) and Lineage Divergence Figure 1 Patterns of Polymorphism and Divergence of Nucleotides along Chromosome Arms Nucleotide π (blue) and div on the D. simulans lineage (red) in 150-kbp windows are plotted every 10 kbp. χ[–log(p)] (olive) as a measure of deviation (+ or –) in the proportion of polymorphic sites in 30-kbp windows is plotted every 10 kbp (see Materials and Methods). C and T correspond to locations of centromeres and telomeres, respectively. Chromosome arm 3R coordinates correspond to D. simulans locations after accounting for fixed inversion on the D. melanogaster lineage. Not surprisingly, many patterns of molecular evolution identified from previously published datasets were confirmed in this genomic analysis. For example, synonymous sites and nonsynonymous sites were the fastest and slowest evolving sites types, respectively [20–24]. Nonsynonymous divergence (dN) and synonymous divergence (dS) were positively, though weakly, correlated (r 2 = 0.052, p 6 for each of polymorphisms, fixations, synonymous variants, and nonsynonymous variants (Dataset S1). The filtered data set of unpolarized MK tests contained 6,702 genes, of which 1,270 (19%) were significant (in the direction of adaptive evolution) at the 0.05 critical value and 539 (8%) genes were significant at a 0.01 critical value. Given that MK tests can only detect directional selection when multiple beneficial mutations have fixed, these results provide a conservative view of the prevalence of adaptive protein divergence. There was a slight enrichment of significant unpolarized MK tests on the autosomes relative to the X chromosome (Fisher's Exact test, p = 0.0014). However, conclusions regarding the incidence of directional selection on autosomes versus the X chromosome should be tempered by the fact that the average numbers of polymorphic and fixed variants per gene may differ between the two types of chromosomes, which affects the power of the MK test to reject neutrality. We observed no enrichment of significant tests in regions of the X chromosome hypothesized to experience greater versus lower rates of crossing over. Several of the most highly significant MK test statistics are from genes with known functions and in many cases, known names and mutant phenotypes. More generally, among the genes with no associated GO term, a smaller proportion had significant unpolarized MK tests compared to the proportion for genes associated with one or more GO terms (0.16 versus 0.20, p = 3 × 10−5). Included among the most highly significant genes in the unpolarized MK tests (Table S12) were several with reproduction-related functions. For example, the sperm of males carrying mutations in Pkd2 (CG6504), the gene with the smallest MK p-value in the genome, are not properly stored in females, suggesting sperm–female interactions (perhaps associated with sperm competition) as a possible agent of selection [92,93]. Other examples include Nc (CG8091), which plays a role in sperm individualization [94]; Acxc (CG5983), a sperm-specific adenylate cyclase [95]; and Dhc16F (CG7092), which is a component of the axonemal dynein complex (suggesting a possible role of selection on sperm motility). For polarized MK tests, we used the D. yakuba genome to infer which fixed differences between D. simulans and D. melanogaster occurred along the D. simulans lineage (Materials and Methods). These fixations were then compared to D. simulans polymorphisms. This reduced, filtered dataset contained 2,676 genes of which 384 (14%) and 169 (6%) were significant at the 0.05 and 0.01 levels, respectively (deviating in the direction of adaptive evolution; Datasets S1). Twenty-three genes showed evidence of a significant (p 10 genes include nuclear envelope, nuclear pore, amino acid-polyamine transporter activity, ubiquitin-specific protease activity, protein deubiquitination, and protein import into the nucleus. Results from a comparable analysis of D. melanogaster protein evolution are shown in Table S21. Using the same criteria of n > 10 genes and p 2) indicative of directional selection on 5′ UTRs, 3′ UTRs, and intron sequences, respectively. Among the most unusual 5′UTRs are those associated with genes coding for proteins associated with the cytoskeleton or the chromosome, categories that also appeared as unusual in the MK tests on protein variation. Two of the top-ten 3′ UTRs are associated with the SAGA complex, a multi-subunit transcription factor involved in recruitment of RNA Pol II to the chromosome [111]. Among the extreme introns, two are from genes coding for components of the ABC transporter complex and two are from genes coding for centrosomal proteins, again pointing to the unusual evolution of genes associated with the cytoskeleton and chromosome structure and movement. As previously noted, a large number of significant UTRs deviate in the direction of excess polymorphism (relative to synonymous mutations). Given the potential importance of the UTRs in regulating transcript abundance and localization, translational control, and as targets of regulatory microRNAs [112], such UTRs could be attractive candidates for functional investigation. Contingency tests of significant versus nonsignificant MK test for amino acids versus each of the noncoding elements yielded p-values of 0.65, 0.04, and 0.07 for 5′ UTRs, 3′ UTRs, and introns, respectively. Thus, there is weak evidence that genes under directional selection on amino acid sequences tend to have 3′ UTRs and introns influenced by directional selection as well. Whole-Genome Analysis of Polymorphic and Fixed Variants Up to this point, our analyses have investigated various attributes of polymorphism and divergence based on windows or genes. An alternative approach for understanding the causes of variation and divergence is to analyze polymorphism and divergence across site types. Table 2 shows whole-genome counts of polymorphic and polarized fixed variants for UTRs, synonymous sites, nonsynonymous sites, introns, and intergenic sites. We also provide data for polarized, synonymous preferred or unpreferred variants. Almost all preferred versus unpreferred codons in D. melanogaster end in GC versus AT, respectively [113]; thus, preferred versus unpreferred codons can be thought of as GC-ending versus AT-ending codons. Table 2 Whole-Genome Counts of Polarized Polymorphic and Fixed Variants Nonsynonymous sites showed the smallest ratio of polymorphic-to-fixed variants, which is consistent with the MK tests and supports the idea that such sites are the most likely to be under directional selection. Nonsynonymous polymorphisms also occur at slightly lower frequency than do noncoding variants (Table S25). Synonymous sites have the highest ratio of polymorphic-to-fixed variants, which supports the previously documented elevated ratio of polymorphic-to-fixed unpreferred synonymous variants in D. simulans [89]. The confidence intervals of the ratio of polymorphic-to-fixed variants among site types are nonoverlapping with the exception of intron and intergenic sites. If preferred synonymous mutations are, on average, beneficial [89,114], then the smaller polymorphic-to-fixed ratio for nonsynonymous and UTR variants versus preferred variants implies that a large proportion of new nonsynonymous and UTR mutations are beneficial. Using similar reasoning, the data in Table 2 suggest that directional selection plays a larger role in nonsynonymous and UTR divergence compared to intron and intergenic divergence [20,115,116]. These conclusions are consistent with estimates of α [11,117], the proportion of sites fixing under directional selection (assuming that synonymous sites are neutral and at equilibrium) for different site types. Base Composition Evolution Determining the relative contributions of various mutational and population genetic processes to base composition variation and inferring the biological basis of selection on base composition remain difficult problems. Much of the previously published data on base composition variation in D. simulans have been from synonymous sites [55,89,90,118]. Several lines of evidence [55,89,90,113,118] suggest that on average, preferred codons have higher fitness than unpreferred codons, with variation in codon usage being maintained by AT-biased mutation, weak selection against unpreferred codons, and genetic drift [23,114]. However, the possibilities of nonequilibrium mutational processes and/or natural selection favoring different base composition in different lineages have also been addressed [119]. The D. simulans population genomic data allow for a thorough investigation of the population genetics and evolution of base composition for both coding and noncoding DNA [59,120]. The analyses discussed below use parsimony to polarize polymorphic and fixed variants. Complete genomic and gene-based data are available as Datasets S9 and S10. Synonymous sites. Previous reports suggested that D. simulans synonymous sites are evolving towards higher AT content, although the excess of AT over GC fixations is small [55]. That trend was confirmed in this larger dataset; there are many more ancestral preferred codons that have fixed an unpreferred codon (coverage classes four–six, n = 21,156) in D. simulans compared with ancestral unpreferred codons that have fixed a preferred codon (coverage classes four–six, n = 15,409). Furthermore, the population genomic data also support previous reports [89] that D. melanogaster synonymous sites are becoming AT-rich at a faster rate than D. simulans synonymous sites (Table S26), contributing to the higher median dS in D. melanogaster (0.069) compared to D. simulans (0.051, Wilcoxon Signed Rank, p 200 bp. Expected heterozygosity was also estimated for genomic features (exons, introns, UTRs, and intergenic sequence) that had a minimum size and coverage [i.e., n(n – 1) × s ≥ 100, where n = average number of alleles sampled and s = number of sites]. For coding regions, the numbers of silent and replacement sites were counted using the method of Nei and Gojobori [129]. The pathway between two codons was calculated as the average number of silent and replacement changes from all possible paths between the pair. The variance of pairwise differences in sliding windows (150-kb windows, 10-kb increments) was used as a method of summarizing the magnitude of linkage disequilibrium across the D. simulans genome. For each window, we calculated coverage weighted variance of the expected heterozygosity (see above) for all pairs of alleles. Divergence. Unpolarized (i.e., pairwise) divergence between D. simulans and D. melanogaster was estimated for 10-kb windows, 50-kb windows, 30-kb sliding windows (10-kb increments), 150-kb sliding windows (10-kb increments), 210-kb windows (10-kb increments), and genomic feature that had a minimum number of nucleotides represented [i.e., n × s > 100, with n and s as above in calculations of π. Unpolarized divergence was calculated as the average pairwise divergence at each site, which was then summed over sites and divided by the total number of sites. A Jukes-Cantor [130] correction was applied to account for multiple hits. For coding regions, the numbers of silent and replacement sites were counted using the method of Nei and Gojobori [129]. The pathway between two codons was calculated as the average number of silent and replacement changes from all possible paths between the pair. Estimates of unpolarized divergence over chromosome arms were calculated for each feature with averages weighted by the number of sites per feature. Lineage-specific divergence was estimated by maximum likelihood using PAML v3.14 [131] and was reported as a weighted average over each line with greater than 50 aligned sites in the segment being analyzed. Maximum likelihood estimates of divergence were calculated over 10-kb windows, 50-kb windows, 30-kb sliding windows (10-kb increments), 150-kb sliding windows (10-kb increments), 210-kb windows (10-kb increments), and gene features (exons, introns, and UTRs). PAML was run in batch mode using a BioPerl wrapper [132]. For noncoding regions and windows, we used baseml with HKY as the model of evolution to account for transition/transversion bias and unequal base frequencies [133]; for coding regions, we used codeml with codon frequencies estimated from the data. Insertion and deletion divergence was calculated as divi , the coverage-weighted average divergence of deletions (i = ▵) or insertions (i = ▿) per base pair. where nc is the number of aligned base pairs in the genomic region (e.g., gene feature or window) with sequencing coverage c. kcj is the number of sites in this region with coverage c at which the derived state with respect to the D. melanogaster reference sequence (▵ or ▿) occurs in j out of the c sequences. MK tests (unpolarized and polarized). Unpolarized MK tests [4] used D. simulans polymorphism data and the D. melanogaster reference sequence for counting fixed differences. Polarized MK tests used D. yakuba to infer the D. simulans/D. melanogaster ancestral state. For both polarized and unpolarized analyses, we took the conservative approach of retaining for analysis only codons for which there were no more than two alternative states. For cases in which two alternative codons differed at more than one position, we used the pathway between codons that minimized the number of nonsynonymous substitutions. This is conservative with respect to the alternative hypothesis of adaptive evolution. Polymorphic codons at which one of the D. simulans codons was not identical to the D. melanogaster codon were not included. To reduce multiple testing problems, we filtered the data to retain for further analysis only genes that exceeded a minimum number of observations; we required that each row and column in the 2 × 2 table (two variant types and polymorphic versus fixed) sum to six or greater. Statistical significance of 2 × 2 contingency tables was determined by Fisher's Exact test. MK tests were also carried out for introns and Gold Collection UTRs by comparing synonymous variants in the associated genes with variants in these functional elements. For intergenic MK tests, we used synonymous variants from genes within 5 kb of the 5′ and/or 3′boundary of the intergenic segment. For some analyses, we restricted our attention to MK tests that rejected the null in the direction of adaptive evolution. This categorization was determined following Rand and Kann [134]. Polarized 2 × 2 contingency tables were used to calculate α, which under some circumstances can be thought of as an estimate of the proportion of variants fixing under selection [11]. Bootstrap confidence intervals of α and of the ratio of polymorphic-to-fixed variants for each functional element (Table 2) were estimated in R using bias correction and acceleration [135]. Rate variation. Our approach takes overall rate variation among lineages into account when generating expected numbers of substitutions under the null model and allows for different rates of evolution among chromosome arms (e.g., a faster-X effect). For example, the number of substitutions for all X-linked 50-kb windows was estimated using PAML (baseml), allowing different rates for each lineage. All D. simulans lines were used, with the estimated substitution D. simulans rate for each window being the coverage-weighted average. This generated an empirically determined branch length of the X chromosome for the average over each of the D. simulans lines (from all three way comparisons with D. melanogaster and D. yakuba) weighted by the number of bases covered. We carried out a relative rate test for windows or features in D. simulans and D. melanogaster by generating the expected number of substitutions for each window/feature/lineage based on the branch length of the entire chromosome in each lineage (PAML) and the coverage of the window/feature in question in each lineage. We then calculated the deviation from the expected number of substitutions as (observed – expected substitutions)2/expected substitutions for any window/feature/lineage. GO by MK permutations. For each GO term associated with at least five MK tests, we calculated the proportion of significant (p Genes Located in Genomics Regions Showing Disproportionate Reductions of Nucleotide Heterozygosity in the US Sample (68 KB DOC) Click here for additional data file. Table S11 GO Terms Overrepresented in Windows from Out-of-Africa/Madagascar Analysis. MF and BP, molecular function and biological process, respectively (50 KB DOC) Click here for additional data file. Table S12 GO Terms Associated with the Top 20 Genes with the Smallest Unpolarized MK Test p-Value (118 KB DOC) Click here for additional data file. Table S13 Genes Showing Excess Protein Polymorphism (p 2) (55 KB DOC) Click here for additional data file. Table S23 Genes Associated with the Most-Significant 3′ UTR Polarized MK Tests (Average Coverage per Site > 2) (52 KB DOC) Click here for additional data file. Table S24 Genes Associated with the Most-Significant Intron MK Tests (Average Coverage per Site > 2) (64 KB DOC) Click here for additional data file. Table S25 Number (Frequency) of Nonsynonymous and Noncoding Polymorphisms (Sites with Coverage of n = 5 or n = 6 D. simulans Alleles) for Different Frequency Classes (40 KB DOC) Click here for additional data file. Table S26 Counts and Substitution Rates per Site of Preferred and Unpreferred Variants “Fixed” along the D. simulans and D. melanogaster Lineages (Inferred by Parsimony) Substitution rates were determined by dividing the number of preferred/unpreferred fixations by the number of unpreferred/preferred ancestral bases. (74 KB DOC) Click here for additional data file. Table S27 X and A, Polymorphic and Fixed, Preferred and Unpreferred Variants for Sites with Coverages Four, Five, or Six (33 KB DOC) Click here for additional data file. Table S28 Unpreferred Polymorphisms (Coverage Five Sites) Occur at Lower Frequency than Preferred Polymorphisms (30 KB DOC) Click here for additional data file. Table S29 Genes with Significant Polarized MK Tests Have a Higher Proportion of Preferred Fixations than Genes with Nonsignificant MK Tests (27 KB DOC) Click here for additional data file. Table S30 Preferred, Unpreferred, and Noncoding GC/AT Fixed Variants across the Genome (Coverage Classes Three–Six) (27 KB DOC) Click here for additional data file. Table S31 Polymorphic GC Variants Occur at Higher Frequency than Polymorphic AT Variants X-linked polymorphic GC variants occur at higher frequency than autosomal polymorphic GC variants (coverage-six polymorphisms from intergenic and intron DNA). (32 KB DOC) Click here for additional data file. Table S32 D. yakuba Genome Input and Assembly Statistics Statistics presented are for the whole-genome assembly before it was anchored using alignments to D. melanogaster. “Contigs” are contiguous sequences not interrupted by gaps, and “supercontigs” are ordered and oriented “contigs” including estimated gap sizes. The N50 statistic is defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L. The total contig size was 167 Mb, with 97% of the consensus base pairs having quality scores of at least 40 (Q40) (expected error rate of less than or equal to 10−4) and 98% are at least Q20. (59 KB DOC) Click here for additional data file. Table S33 Read and Trim Statistics for D. simulans Syntenic Assemblies (35 KB DOC) Click here for additional data file. Table S34 Correlation (Kendall's τ) between Copy Numbers of TE Families in “Trimmed” Euchromatic Regions of D. simulans and D. melanogaster The simulans TEs are the “clustered” TEs. The melanogaster TEs are those annotated in release 4.0. (31 KB DOC) Click here for additional data file. Table S35 Tests of the Homogeneity of the Proportions of Each Family across Six D. simulans Lines, Homogeneity of Classes across Lines, and Homogeneity of Families within Classes across Lines (33 KB DOC) Click here for additional data file. Table S36 Test of the Homogeneity of Relative Family Copy Numbers across the Five Chromosome Arms (Pooled across Lines) for All TEs and within the Four Classes (33 KB DOC) Click here for additional data file. Table S37 Test of the Homogeneity of Relative Family Copy Numbers on the X chromosome versus the Autosomes (Pooled across Lines) for All TEs and within the Four Classes (32 KB DOC) Click here for additional data file. Table S38 Heterogeneity of “Cloned” TE Numbers in Various Gene Annotation Elements (29 KB DOC) Click here for additional data file. Table S39 Comparison of Expected D. simulans Nucleotide Heterozygosity and Divergence for 30-kb Windows Centered on the Estimated Position of “Clustered” TEs (+) Compared to Windows without Clustered TEs (–) The difference between the distributions (TEs: +/-) was tested with the Mann-Whitney U test; the p-value is in the upper position in the last column (probability < / ratio). The ratio of the means is also shown (lower in last column). (50 KB DOC) Click here for additional data file. Text S1 Transposable Elements (48 KB DOC) Click here for additional data file. Accession Numbers The GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) accession number for D. yakuba is AAEU01000000 (version 1) and for the D. simulans w501 whole-genome shotgun assembly is TBS-AAEU01000000 (version 1).
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Genet
                PLoS Genet
                plos
                plosgen
                PLoS Genetics
                Public Library of Science (San Francisco, USA )
                1553-7390
                1553-7404
                December 2012
                December 2012
                20 December 2012
                : 8
                : 12
                : e1003090
                Affiliations
                [1 ]Computer Science Division, University of California Berkeley, Berkeley, California, United States of America
                [2 ]Department of Statistics, University of California Berkeley, Berkeley, California, United States of America
                University of Oxford, United Kingdom
                Author notes

                The authors have declared that no competing interests exist.

                Conceived and designed the experiments: AHC PAJ YSS. Performed the experiments: AHC PAJ YSS. Analyzed the data: AHC PAJ YSS. Contributed reagents/materials/analysis tools: AHC PAJ YSS. Wrote the paper: AHC PAJ YSS.

                Article
                PGENETICS-D-12-01101
                10.1371/journal.pgen.1003090
                3527307
                23284288
                83c3a98a-0556-46f3-8843-c4054d67cc02
                Copyright @ 2012

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 4 May 2012
                : 29 September 2012
                Page count
                Pages: 28
                Funding
                This research is supported in part by a National Institutes of Health grant R01-GM094402, an Alfred P. Sloan Research Fellowship, and a Packard Fellowship for Science and Engineering to YSS, and by a National Defense Science and Engineering Graduate (NDSEG) Fellowship to AHC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology
                Evolutionary Biology
                Population Genetics
                Mathematics
                Probability Theory
                Stochastic Processes
                Statistics
                Statistical Methods

                Genetics
                Genetics

                Comments

                Comment on this article