8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A unique cluster of roo insertions in the promoter region of a stress response gene in Drosophila melanogaster

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Transposable elements (TEs) are not randomly distributed in the genome. A genome-wide analysis of the D. melanogaster genome found that differences in TE density across 50 kb genomic regions was due both to transposition and duplication. At smaller genomic scales, promoter regions of hsp genes and the promoter region of CG18446 have been shown to accumulate TE insertions. In this work, we have further analyzed the promoter region of CG18446. We screened 218 strains collected in 15 natural populations, and we found that the CG18446 promoter region contains 20 independent roo insertions. Based on phylogenetic analysis, we suggest that the presence of multiple roo insertions in this region is likely to be the result of several bursts of transposition. Moreover, we found that the roo insertional cluster in the CG18446 promoter region is unique: no other promoter region in the genome contains a similar number of roo insertions. We found that, similar to hsp gene promoters, chromatin accessibility could be one of the factors explaining the recurrent insertions of roo elements in CG18446 promoter region.

          Electronic supplementary material

          The online version of this article (10.1186/s13100-019-0152-9) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references35

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines

          The Drosophila melanogaster Genetic Reference Panel (DGRP) is a community resource of 205 sequenced inbred lines, derived to improve our understanding of the effects of naturally occurring genetic variation on molecular and organismal phenotypes. We used an integrated genotyping strategy to identify 4,853,802 single nucleotide polymorphisms (SNPs) and 1,296,080 non-SNP variants. Our molecular population genomic analyses show higher deletion than insertion mutation rates and stronger purifying selection on deletions. Weaker selection on insertions than deletions is consistent with our observed distribution of genome size determined by flow cytometry, which is skewed toward larger genomes. Insertion/deletion and single nucleotide polymorphisms are positively correlated with each other and with local recombination, suggesting that their nonrandom distributions are due to hitchhiking and background selection. Our cytogenetic analysis identified 16 polymorphic inversions in the DGRP. Common inverted and standard karyotypes are genetically divergent and account for most of the variation in relatedness among the DGRP lines. Intriguingly, variation in genome size and many quantitative traits are significantly associated with inversions. Approximately 50% of the DGRP lines are infected with Wolbachia , and four lines have germline insertions of Wolbachia sequences, but effects of Wolbachia infection on quantitative traits are rarely significant. The DGRP complements ongoing efforts to functionally annotate the Drosophila genome. Indeed, 15% of all D. melanogaster genes segregate for potentially damaged proteins in the DGRP, and genome-wide analyses of quantitative traits identify novel candidate genes. The DGRP lines, sequence data, genotypes, quality scores, phenotypes, and analysis and visualization tools are publicly available.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The Many Landscapes of Recombination in Drosophila melanogaster

            Introduction Recombination is a fundamental biological process. Mechanistically, most sexual eukaryotes require recombination between homologous chromosomes for proper formation of haploid gametes from diploid germ cells [1]. Evolutionarily, recombination is predicted to increase the effectiveness of selection in natural populations, thus explaining the pervasiveness of recombination and sex [2]–[9]. In most species, recombination rates also vary among and along chromosomes. This intra-genomic variation provides a unique opportunity to evaluate the evolutionary consequences of recombination due to exposure to identical demographic and environmental factors; an all else being equal premise that is hardly ever warranted when comparing populations or species. Population genetic analyses across genomes have confirmed the profound effects that recombination imposes on the evolutionary process, shaping levels of genetic variation, limiting the accumulation of deleterious mutations and enhancing rates of adaptation in many species including Drosophila, Arabidopsis thaliana, Saccharomyces cerevisiae and nematodes [10]–[25]. Yet these studies have been hindered, at least in part, by three limitations associated with the lack of detailed characterization of natural variation in recombination across genomes and within species. First, population genomic analyses can now analyze patterns of selection at the scale of single genes, or even focus on specific gene regions, but most whole-genome genetic maps localize recombination events with much less detail [24], [26]–[31]. Second, most species' maps based on direct measures of recombination are obtained either from two specific, usually highly diverged, strains (e.g., Caenorhabditis elegans [30] or S. cerevisiae [32]), or compiled from crosses in different laboratory/natural conditions that can also influence recombination rates [13], [26], [28]. These genetic maps are customarily assumed to represent a monomorphic description of recombination for a given species, even when there is ample evidence of intra-specific natural variation. In species like A. thaliana, D. melanogaster, C. elegans, maize, mice or humans, variation has been reported for the number of recombination events at the level of whole chromosomes [28], [30], [33]–[35] or for specific chromosomal intervals [36]–[44], but a high-resolution whole-genome study of variation in genetic maps based on multiple natural strains is lacking. Third, the process of meiotic recombination associated with the repair of double strand breaks (DSB) has two possible outcomes with diverse evolutionary consequences: crossing over (CO) and non-crossing over (or gene conversion; GC). Although CO events also include a gene conversion tract, we will use GC to refer to recombination events that are not accompanied by crossing over (Figure S1). Unlike CO, GC results in the exchange of only small tracts of a chromosome, interrupting linkage disequilibrium in a much localized manner while having no effect on linkage disequilibrium at longer intervals, thus GC plays a stronger role than CO at short physical distances [45]–[48]. Corresponding high-resolution CO and GC maps based on direct experimental detection of recombination events however are only available for the unicellular yeast Saccharomyces cerevisiae [32]. In multicellular organisms, CO maps are used as proxy for total recombination, and the consequences of GC are overlooked, despite its potential influence on total recombination, particularly in regions or chromosomes with limited CO. To alleviate all these deficiencies, we generated high-resolution, whole-genome CO and GC maps from eight crosses between natural strains of D. melanogaster (see Materials and Methods for details). In order to obtain haploid genomes resulting from female meioses, we crossed heterozygous D. melanogaster females to males of D. simulans and genotyped the female hybrid progeny using whole-genome high-throughput Illumina sequencing. Reads corresponding to D. simulans were removed bioinformatically by mapping them to D.simulans genomic sequences. We then generated a whole-genome D. melanogaster haplotype per genotyped fly by mapping high-quality informative SNPs to one of the two parental D. melanogaster strains. Recombination events were detected, and directly assigned as CO or GC events, based on changes in parental identity along each D. melanogaster haplotype. Overall, we characterized the products of 5,860 female meioses and genotyped an average of 49,000 informative SNPs per fly, for a total of 139 million SNPs. We mapped more than 106,000 recombination events (CO and GC combined) with a median distance to the nearest informative SNP of less than 2.0 kb (1.83 kb). This resolution is almost equivalent to the high-resolution mapping of meiotic recombination in the unicellular S. cerevisiae [32], 15-fold higher than the linkage map in A. thaliana also based on recombinant inbred lines [27], and more than 50-fold more detailed than current high-resolution whole-genome CO maps in humans [28], C. elegans [30], C. briggsae [24], or D. pseudoobscura [29]. Results A high-resolution CO map for D. melanogaster Combining the results from all crosses we detected a total of 32,511 CO events that were used to generate high-resolution CO maps in D. melanogaster (Figure 1). Due to the elevated density of markers and the small number of CO events per chromosome and genotyped fly, each CO is supported by many contiguous markers at either side and it is our expectation that we have detected all COs. The total genetic map length for D. melanogaster obtained in our crosses is 287.3 cM, closely matching classical measures (282 cM [26]). A low-resolution approximation to the distribution of CO rates (c) along chromosome arms based on our data (Figure S2) recovers the same general, large-scale distribution as previous maps based on visible markers [11]–[13], [26], [49]–[53]. As expected, c is sharply reduced near telomeres and centromeres, and we detect no CO events in the small fourth (dot) chromosome that proceeds to meiotic segregation without chiasmata [54]. 10.1371/journal.pgen.1002905.g001 Figure 1 Crossing over rate variation along chromosome arms in D. melanogaster. Rate of crossing over (c) based on data from all crosses and indicated in centimorgans (cM) per megabase (Mb) per female meiosis (red line). c is shown along chromosomes for 100-kb windows and a movement between adjacent windows of 50 kb. Blue lines indicate 90% confidence interval for c at each window. Our detailed maps deepen the recent appreciation for intra-chromosomal variation in CO rates in Drosophila [29], [55], [56] and outline this heterogeneity at a much finer scale across the whole genome. Heterogeneity in CO rates along each chromosome is significant at all physical scales analyzed, from 100 kb to 10 Mb, even after removing centromeric and telomeric regions with visibly reduced rates (P 40-fold) relative to either adjacent regions or to other crosses (Figure 2). As expected, crosses sharing one parental strain have more similar maps than crosses not sharing parental strains but the overall magnitude of the correlation between these crosses, albeit significant, is rather small (Spearman's R = +0.451). This observation reinforces the concept of a highly polygenic and polymorphic basis for CO distribution along chromosomes. 10.1371/journal.pgen.1002905.g002 Figure 2 Intra-specific variation in crossing over rates along chromosome arms. c (cM/Mb per female meiosis) for eight different crosses (different colors) and shown for adjacent 250-kb windows. To quantify variation in CO rates among the eight CO maps we estimated the variance to mean ratio (Index of Dispersion; R CO) and tested whether the different number of CO events at a given region can be explained by a Poisson process. Moreover, we focused on variation in the distribution of CO rates along chromosomes and therefore we took into account the number of total events for each chromosome (see Materials and Methods for details). Our study of R CO along chromosomes reveals many regions (107 or 22% of all non-overlapping 250-kb regions across the genome) with a variance among crosses larger than expected (overdispersion) and this pattern is observed in all chromosomes (Figure 3). The magnitude of this excess variance is highest for chromosome arm 2L while notably reduced for the chromosome arm 3L. Significant overdispersion of CO rates among crosses is also detected when we study larger genomic regions. At a physical scale of 1 Mb, more than half of the genomic regions exhibit excess variance, thus suggesting that regions with variable CO rates are frequent enough across the D. melanogaster genome to be playing a detectable role in a large fraction of these longer sequences. 10.1371/journal.pgen.1002905.g003 Figure 3 Estimate of the Index of Dispersion (R CO) for rates of crossing over along chromosome arms. R CO was obtained by comparing crossing over rates from eight crosses (see Materials and Methods for details) and is shown for adjacent 250-kb windows (blue line). The doted red line indicates the P = 0.0005 confidence threshold (equivalent to P ( = 0.05)/number of windows in whole-genome analyses). These results are consistent with early studies in Drosophila that reported natural variation in CO rates based on artificial selection experiments ([61] and references therein). Our genome-wide study details the genomic location and magnitude of this variation and depicts the first high-resolution polymorphic landscape of CO rates in D. melanogaster. Several genomic regions have low rates in all crosses, thus representing monomorphic (or high-frequency) coldspots for CO in D. melanogaster. Other regions assigned as peaks of CO rates based on combined maps, however, are strongly influenced by polymorphic hotspots at low frequency in our sample. In fact, most regions with excess variance in CO rates among crosses are associated with low-frequency hotspots rather than low-frequency coldspots suggesting that hotspots are transient (short-lived) features within D. melanogaster populations. Our results thus indicate that CO rates based on multiple crosses and genotypes are needed to obtain a representative depiction of a “species” recombination landscape. Additionally, the low frequency of the hotpots will strongly influence measures of recombination based on the arithmetic mean of all maps, suggesting higher rates than measures such the harmonic mean or median (see Figure S3 for a comparison between mean and median CO values). Notably, we observe genomic regions with very low (or zero) median CO rates while the sample mean would suggest average rates. Gene conversion maps in D. melanogaster We have detected a total of 74,453 GC events. Nevertheless, GC tracts that lay between adjacent markers are expected to be missed. Moreover, this underestimation is probably variable across the genome due to differences in SNP and marker density. Therefore, we expanded a maximum likelihood algorithm [62] that was proposed for estimating the length of GC tracts (LGC ) to simultaneously estimate LGC and the rate of GC initiation (γ), and be applicable to any region of arbitrary marker distribution and density (see Materials and Methods for details). Our genome-wide estimates of γ and average LGC are 1.25×10−7/bp/female meiosis and 518 bp, respectively. The study of each chromosome arm separately (Figure 4) shows that arms with evidence of CO (2L, 2R, 3L, 3R and X) have similar estimates of γ (1.13–1.49×10−7/bp/female meiosis) and LGC (456–632 bp). Notably, we observe several GC events in the small achiasmatic chromosome fourth where CO is completely absent. Our estimates of γ and LGC for the fourth chromosome are 0.46×10−7/bp/female meiosis and 1062 bp, respectively. 10.1371/journal.pgen.1002905.g004 Figure 4 Gene conversion estimates in D. melanogaster. Joint maximum-likelihood estimates (MLE) of the rate of gene conversion initiation (γ) and mean gene conversion tract length (L GC) in D. melanogaster. γ units are per bp and female meiosis, and L GC in bp. Red/yellow contours represent 95 confidence intervals for γ and L GC for each chromosome arm independently. The blue dot represents the genome average for γ and L GC based on a total of 74,453 observed GC events. The rosy locus in D. melanogaster is one of the best characterized in higher metazoa for intragenic recombination [63], [64]. These studies showed that GC events are more frequent than CO, with four non-crossover associated GC events to each CO [63]–[65]. In terms of absolute rate, the recovery of intragenic CO events at rosy reveals c∼3.0×10−8/bp/female meiosis [66] thus predicting γ∼1.2×10−7/bp/female meiosis at this locus. When we focus on the 100-kb genomic region encompassing the rosy locus our estimate of γ is 1.17×10−7/bp/female meiosis. At a whole-genome scale, our data suggest a γ (1.25×10−7/bp/female meiosis) and a ratio GC∶CO (∼83% of events result in GC) close to, albeit higher than, the estimates at rosy. A major difference between our results and those from the rosy locus however is the mean length of gene conversion tracts, with our average estimate of LGC (518 bp) significantly exceeding the estimate of 352 bp at rosy [62]. Another approach to estimate GC∶CO ratios is based on using an antibody to γ-His2Av as a molecular marker for DSB formation [67] and monitoring the number of γ-His2Av foci in DSB repair-defective mutants [68]. The number of estimated DSB in D. melanogaster using this methodology is up to 24.2 per genome [68], suggesting that 76.2% of all DSB are resolved as GC when we use the observed number of CO events per female meiosis from our data. The moderately higher fraction of GC observed in our study could be explained by differences among the strains used, if not all DSBs (or DSB-repair pathways) are marked by γ-His2Av staining [68] or if the DSB-repair defective mutants allowed for residual repair thus making some DSBs difficult to detect. Of particular interest will be future research focused on trying to localize experimentally DSBs on the fourth chromosome and other genomic regions where CO is absent but GC is detected. The analysis of the distribution of γ along chromosomes at the 100-kb scale reveals a more uniform distribution than that of CO (c) rates, with no reduction near telomeres or centromeres (Figure 5). More than 80% of 100-kb windows show γ within a 2-fold range, a percentage that contrasts with the distribution of CO where only 26.3% of 100-kb windows along chromosomes show c within a 2-fold range of the chromosome average. To test specifically whether the distribution of CO events is more variable across the genome that either GC or the combination of GC and CO events (i.e., number of DSBs), we estimated the coefficient of variation (CV) along chromosomes for each of the three parameters for different window sizes and chromosome arms. In all cases (window size and chromosome arm), the CV for CO is much greater (more than 2-fold) than that for either GC or DSBs (CO+GC), while the CV for DSBs is only marginally greater than that for GC: for 100-kb windows, the average CV per chromosome arm for CO, GC and DSBs is 0.90, 0.37 and 0.38, respectively. Nevertheless, we can also rule out the possibility that the distribution of GC events or DSBs are completely random, with significant heterogeneity along each chromosome (P 0.20) or introns (R = −0.041, P>0.16). An equivalent lack of association is observed when G+C nucleotide composition is compared to c (P>0.25 for both intergenic sequences and introns). We find therefore no evidence of gene conversion bias favoring G and C nucleotides in D. melanogaster based on nucleotide composition. The causes for some of the previous results that inferred gene conversion bias towards G and C nucleotides in Drosophila may be multiple and include the use of sparse CO maps as well as incomplete genome annotation. Because gene density in D. melanogaster is higher in regions with non-reduced CO [13], [79], the many recently annotated transcribed regions and G+C rich exons [31], [80], [81] may have been previously analyzed as neutral sequences, particularly in these genomic regions with non-reduced CO. The motifs of recombination in Drosophila To discover DNA motifs associated with recombination events (CO or GC), we focused on 1,909 CO and 3,701 GC events delimited by 500 bp or less (CO500 and GC500, respectively). Our D. melanogaster data reveal many motifs significantly enriched in sequences surrounding recombination events (18 and 10 motifs for CO and GC, respectively) (Figure 6 and Figure 7). Individually, the motifs surrounding CO events (MCO) are present in 6.8 to 43.2% of CO500 sequences, while motifs surrounding GC events (MGC) are present in 7.8 to 27.6% of GC500 sequences. Note that 97.7% of all CO500 sequences contain at least one MCO motif and 85.0% of GC500 sequences contain one or more MGC motif (Figure S4). 10.1371/journal.pgen.1002905.g006 Figure 6 Top DNA motifs found enriched in sequences encompassing CO events. We focused on 1,909 CO events delimited by 500 bp or less (CO500 sequences). Only motifs with E-vale 100 when c≤0.1 cM/Mb, consistent with population genetic estimates of γ/c at telomeric regions of the X chromosome of D. melanogaster [94]. 10.1371/journal.pgen.1002905.g008 Figure 8 Relationship between crossing over (c) and gene conversion (γ) rates across the D. melanogaster genome. c and γ based on 100-kb adjacent windows with windows grouped into 6 categories of equal number according to c [CO1, CO2, .., CO6 indicating increasing rates of c] The average c (cM/Mb/female meiosis) values for the six categories is: 0.078 (CO1), 0.727 (CO2), 1.439 (CO3), 2.294 (CO4), 3.299 (CO5) and 5.964 (CO6). Blue columns show results when whole chromosomes are analyzed. Orange columns show results after removing centromeric and telomeric regions with visibly reduced CO rates. There is a significant negative relationship between c and γ using independent (non-overlapping) 100-kb windows: Spearman's R = −0.1246 (P = 1.6×10−5) for whole chromosomes and R = −0.1191 (P = 1.2×10−4) after removing telomeric/centromeric regions. Several hypotheses can be put forward to explain the different landscapes for CO and GC across the D. melanogaster genome. Divergent DSB repair pathways have been proposed in Drosophila as in yeast [100], [101], with a synthesis-dependent strand annealing (SDSA) pathway that is associated only with GC events, while the resolution of the double Holliday junction (DHJ) can generate either CO or GC (Figure S1). The detection of GC events in the fourth chromosome strongly suggests the action of SDSA, at least for a chromosome completely lacking CO, and indicates that SDSA may be acting across the whole genome. The observation that γ increases when c is low even after removing the fourth chromosome and telomeric/centromeric regions (see Figure 8) argues against the option of very large chromosomal domains with DSBs that are repaired by a different pathway. This same negative relationship between γ and c, together with the observed overlap in motifs associated with CO and GC events (see above), suggests a shared origin, and thus indicates that the disparity of landscapes for GC and CO rates across the D. melanogaster genome may be influenced by a difference in the relative usage of DSB repair pathways (eg., DHJ versus SDSA) or by a variable bias in the repair decision when DHJ intermediates are resolved to form either CO or GC products. In yeast, the presence of mismatches interferes with the formation and/or extension of heteroduplex intermediates during mitotic and meiotic DSB repair [102], and sequence divergence inhibits mitotic COs to a greater extent than GCs [103]. A recent genome-wide analysis of meiotic recombination intermediates in yeast, however, suggests that mismatch repair increases the ratio CO∶GC [104]. In agreement, analyses at the rosy locus in D. melanogaster show a small increase in the ratio CO∶GC in the presence of sequence polymorphisms (27 CO and 5 GC) when compared to a case where polymorphisms are virtually absent (23 CO and 8 GC events) [66]. If this tendency is confirmed across the Drosophila genome and if the differences in mismatch presence across the genome are of sufficient magnitude to alter either DHJ/SDSA relative use or the resolution of DHJ into CO or GC, then genomic regions with reduced heterozygosity could favor a DSB repair favoring GC over CO events [105]. At a whole-genome level, nucleotide differentiation between parental strains (ranging between 0.005 and 0.007 for total pairwise differences per bp) shows no association with overall γ/c, c or γ (P>0.4 in all cases). To test the possible influence of mismatch presence across the genome we investigated the correlation between levels of total nucleotide polymorphism (π) and the γ/c ratio based on adjacent 100-kb regions (Figure 9; see Materials and Methods). Congruent with the hypothesis that the choice to repair DBS into either GC or CO is heterozygosity-dependent, we observe a strong negative correlation between total π and γ/c across the whole genome (R = −0.56, P 0.40). The differential distribution of GC and CO when looking at genic and intergenic sequences is consistent with the heterozygosity-dependent GC∶CO repair of DSB proposed above, given that intergenic sequences have higher levels of heterozygosity than genic sequences. Overall, our data suggest a higher probability of DSBs within annotated transcriptional units. 10.1371/journal.pgen.1002905.g010 Figure 10 Recombination events within genic sequences in the D. melanogaster genome. Analyses based on 1,909 and 3,701 CO and GC events delimited by 500 bp or less (CO500 and GC500). (A) Frequency of recombination events (CO or GC) within genic sequences. Probability [P (Freq. Observed 2.5 kb) shows the median GC500 position shifting significantly relative to the TSS (from +556 in short transcripts to +3588 in long transcripts; Mann-Whitney test U = 51,192, P 100-kb) and small- (i.e., single-gene) scales, among crosses and for GC and CO. Each of these levels is expected to play a role in shaping evolutionary patterns. We propose that these overlapping, non-independent and dynamic landscapes of recombination should be included in a new generation of population genetic models of selection and recombination. Materials and Methods Drosophila melanogaster strains We generated eight crosses using 12 highly inbred strains of D. melanogaster. Ten of these strains were recently generated by the laboratory of Trudy Mackay from a natural population collected in Raleigh (NC, USA) after 20 generations of full sib mating (RAL strains). Importantly, these strains have maintained substantial genetic/phenotypic variation between lines [141], [142]. The RAL strains used in this project are: RAL-208, -301, -306, -375, -380, - 391, -514, -712, -786 and -820. We also used two non-US strains, collected in Madagascar (MDG) and Papua New Guinea (PNG) obtained from the Bloomington Drosophila Stock Center (Indiana University). Table 1 indicates the eight crosses under study. Our stocks of RAL-786 and -820 carried fixed inversions for chromosome arm 3R and therefore the analyses of recombination rates for this chromosome arm excluded these two crosses. 10.1371/journal.pgen.1002905.t001 Table 1 Cross, parental strains, and number of female meioses. Cross # Parental Strain 1 Parental Strain 2 Number of female meioses 1 PNG MDG 762.5 2 RAL-208 RAL-375 1522 3 RAL-306 RAL-391 493.5 4 RAL-375 RAL-514 720 5 RAL-208 PNG 752.5 6 RAL-301 RAL-375 627 7 RAL-712 RAL-786 499.5 8 RAL-380 RAL-820 483 The genomes of the RAL strains have been sequenced [The Drosophila Population Genomics Project (DPGP [17]), and The Drosophila Genetic reference Panel (DGRP [142]). Nevertheless, and for all strains including RALs, we obtained Illumina sequence reads and generated genomic sequences of the strains used in our laboratory for crosses to get an accurate (current) description of SNPs and small indels for all parental strains, including the possible presence of heterozygous sites. Crosses and generation of Recombinant Advanced Intercross Lines (RAILs) A number of factors can influence recombination in D. melanogaster including female age, temperature, number of matings or food [57]–[60]. To reduce the effects of these factors, all crosses and strains were maintained at the same temperature (23C) under a 12 hr. light/dark cycle, with constant fly density in half pint bottles on standard corn-meal media. Recombination is higher for young (1–2 days) and old (>12 days) females [59]. Therefore all crosses were carried out with flies of equivalent age, with 6 hour old virgin flies allowed to mate and lay eggs for a period of 36–48 hours, after which the parents were discarded and the offspring were allowed to emerge. For all crosses, parental strains were crossed in both directions to average out possible maternal effects on recombination rates. Standard approaches to assess rates of crossing over often cross two inbred strains and genotype progeny from F2 backcrosses. To maximize the number of recombination events per fly genotyped, we generated recombinant advanced intercross lines (RAIL). To generate RAILs by sibling mating and minimize homozygosity, we crossed five groups of forty F1 flies, twenty males and twenty females, randomly collected from the previous generation and put together in new bottles. The F1 flies were also allowed to mate for 1.5–2 days after which they were discarded and F2 offspring were allowed to emerge. To generate the next generation, flies from all bottles were randomly mixed to create the next groups of twenty males and twenty females. For all crosses, the number of bottles was increased progressively, starting with five up to ten, to reduce the effect of random genetic drift that could generate homozygous combinations for recombination events thus underestimating recombination rates. Our crosses varied in terms of sib-mating generations, ranging between 1 (standard approach with a single meiosis in heterozygous D. melanogaster females) and 10 generations. To estimate the number of female meioses associated with each cross we modeled our crossing design with simulations mimicking the corresponding mating process, including the number of individuals (males and females) and bottles per generation (see Table 1 for the number of meioses corresponding to each cross). The use of the ‘equivalent meioses’ estimated with this approach generated a genetic map length for D. melanogaster of 287.3 cM. This value closely matches classical measures (282 cM) in this species [26] thus indicating that our approach to estimate the number of meiosis captures appropriately the map expansion with number of generations and the crossing design. Table 2 shows the genetic map length for each cross and chromosome arm. 10.1371/journal.pgen.1002905.t002 Table 2 Genetic map length per chromosome arm and cross. Cross # X 2L 2R 3L 3R 1 71.8 49.8 45.8 54.9 47.6 2 66.2 55.8 53.8 52.7 60.8 3 56.4 64.9 60.0 49.9 56.7 4 63.4 62.5 51.4 51.8 61.3 5 67.5 55.6 49.0 55.9 56.2 6 66.6 52.4 55.0 50.3 66.5 7 73.0 55.9 58.6 53.4 * 8 57.7 57.5 59.3 53.9 * * Fly stocks RAL-786 and -820 carry fixed inversions for chromosome arm 3R. Finally, female D. melanogaster RAILs were crossed to D. simulans males to obtain hybrid females for genotyping. We originally used D. simulans males from strain w501, but we finally decided to use D. simulans males from the strain Florida City due to the much higher fraction of hybrids produced. D. melanogaster×D. simulans crosses were conducted using D. simulans males aged for 3–4 days before combining them for interspecific mating. After mating, D. simulans males were discarded and D. melanogaster females were individualized and maintained in vials to lay eggs. For each cross between two parental strains, we froze (−20°C) the hybrid offspring of more than 350 mated D. melanogaster RAIL females. To ensure replicates and resequencing if needed, we froze 6 hybrid offspring females from each D. melanogaster mated females. Figure S5 shows a schematic representation of the crossing methodology. Note that hybrids are used for genotyping only and the meiotic products from these hybrids are not analyzed. In total, we sequenced 2,829 hybrid females (with 2,829 single-fly Illumina libraries, see below), corresponding to 5,860 meioses. DNA extraction We extracted DNA from single hybrid females using Tissue Lyser LT (Qiagen, Valencia, CA) followed by a modified Qiagen DNeasy Blood & Tissue Kit (Qiagen, Valencia, CA) protocol. 90% of the eluted DNA was used for preparing a library for each individual fly. The rest of the sample was frozen and kept for posterior validation of markers, crossing over (CO) or gene conversion (GC) events if needed. The DNA of each fly was fragmented using a Bioruptor UCD-200 (Diagenode, Denville, NJ) with 0.5 ml tubes and settings that maximize the concentration of sheared DNA at 300 bp. Most of the eluted fragmented DNA was used for library preparation while the remaining (15%) of the sample was used for analysis of fragmented DNA for each individual fly by measuring the relative amount of a segment of the gene rp49 using Real-time PCR analysis (Roche LightCycler 480; Roche, IN, USA). This first quantification of single fly DNA served to confirm successful DNA extraction to be used for Illumina libraries. A second relative quantification of DNA was also performed after ligation of Illumina adapters (see below) following an equivalent protocol and used to normalize the amount of library DNA from each fly for multiplex sequencing. Illumina library preparation Single fly Illumina libraries of sheared DNA were prepared using NEBNext reagents and protocols (New England Biolabs, MA, USA) scaled-down appropriately for small amounts of DNA (detailed protocol available from J.M.C. upon request) and using indexed Illumina adapters. When sequencing multiplexed samples (using different tags) there is the possibility of mistakenly assigning reads to the wrong sample due to sequencing errors in the tag sequence. The likelihood of miss-assignment rapidly increases with, 1) shorter tags, 2) with the use of sequence reads not perfectly matching tag sequences and, 3) when tag sequences are only one or two nucleotides away from other tag-sequences. To reduce this potential error as much as possible our custom 7-nucleotide tags (7 nt plus the required ‘T’ for Illumina primer ligation) were designed to be a minimum of 4 nucleotide changes away from any other tag. Additionally, we only used Illumina reads with initial sequences completely identical to the tags used. This conservative approach is needed to prevent read miss-assignment in marker-based, light sequencing studies. Our 24 custom-designed 7-nucleotides tags are described in Table S1. The use of 24 indexed Illumina adapters allowed us to multiplex 24 single-fly libraries in a single Illumina lane to be later separated computationally. The PCR enriched libraries were validated by running an aliquot on a standard agarose gel. Final concentration quantification was obtained with Quant-iT PicoGreen dsDNA Reagent and Kits (Invitrogen, CA, USA) on a Turner BioSystems TBS-380 Fluorometer. Twenty four equimolar PCR-enriched libraries were pooled together to obtain a final set of multiplexed single-fly libraries that would be sequenced in a single Illumina lane. Sequencing was mostly performed using the Illumina Genome Analyzer IIx and Illumina HiSeq 2000 instruments at the Iowa State University DNA Facility with additional plates sequenced at the High Throughput Sequencing Facility of the University of North Carolina at Chapel Hill using the Illumina Genome Analyzer IIx. Genomic analyses and bioinformatic pipeline Generation and annotation of parental reference sequences We generated reference genomic sequences for all D. melanogaster parental strains and for the D. simulans Florida City strain that we used for our crosses. We obtained 75- and 100-nt Illumina single reads in excess of 25-fold coverage for the D. melanogaster RAL strains and in excess of 40-fold coverage for D. melanogaster MDG and PNG and for D. simulans Florida City strains. To generate RAL reference parental sequences we added the sequencing reads from the NCBI short read archive (SRA) study SRP000694 [142] to our 75-nt or 100-nt reads. Filtering of reads, mapping and generation of consensus reference sequences for parental strains was carried out using the FASTX-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/), BWA [143], SAMtools v1.4 [144] and a collection of custom PERL scripts using as reference the D. melanogaster genome sequence (r.5.30; http://flybase.org). Contrary to standard approaches to generating consensus sequences based on SNP calling, we generated parental reference sequences specifically intended for our mapping purposes. We focused on taking into account heterozygous sites in parental strains that could miss-assign the origin of individual reads as well as annotate as unreliable sites those sites with limited representation (coverage). Two distinct issues associated with heterozygosity within strains were detected. First, residual heterozygosity (present when the lines were originally sequenced, ca. 2008–2009) and maintained in the strain that was used in our laboratory for crosses. Second, sites showing a different high-frequency/monomorphic variant in our laboratory relative to when they were originally sequenced. We annotated (marked) each potential heterozygous site in the reference sequence of parental strains as ambiguous sites using the appropriate IUPAC ambiguity code using a permissive approach. We used full (raw) pileup files and conservatively considered as heterozygous site any site with a second (non-major) nucleotide at a frequency higher than 5% regardless of consensus and SNP quality. For instance, if the sequencing of a parental strain of D. melanogaster generates 12 reads exhibiting an ‘A’ and 1 read exhibiting a ‘G’ at a particular nucleotide position, the reference will be marked as ‘R’ even if consensus and SNP qualities are 60 and 0, respectively. We assigned ‘N’ to all nucleotide positions with coverage less that 7 regardless of consensus quality because of the lack of information on their heterozygous nature. We also assigned ‘N’ to positions with more than 2 nucleotides. This approach is conservative when used for marker assignment because the mapping protocol (see below) will remove heterozygous sites from the list of informative sites/markers while also introducing a “trapping” step for Illumina sequencing errors that may be not fully random. Finally we introduced insertions and deletions for each parental reference sequence based on raw pileup files. Mapping of reads and generation of D. melanogaster recombinant haplotypes Sequences were first pre-processed and only reads with sequences exact to one of tags were used for posterior filtering and mapping. FASTQ reads were quality filtered and 3′ trimmed, retaining reads with at least 80% percent of bases above quality score of 30, 3′ trimmed with minimum quality score of 12 and a minimum of 40 bases in length. Any read with one or more ‘N’ was also discarded. This conservative filtering approach removed an average of 22% of reads (between 15 and 35% for different lanes and Illumina platforms). We then eliminated all reads with possible D. simulans Florida City origin, either truly originating from the D. simulans chromosomes or with D. melanogaster origin but similar to a D. simulans sequence. We used MOSAIK assembler (http://bioinformatics.bc.edu/marthlab/Mosaik) to map reads to our marked D. simulans Florida City reference sequence. Contrary to other aligners, MOSAIK can take full advantage of the set of IUPAC ambiguity codes during alignment and for our purposes this allows the mapping and removal of reads when represent a sequence matching a minor allele within a strain. Moreover, MOSAIK was used to map reads to our marked D. simulans Florida City sequences allowing 4 nucleotide differences and gaps to remove D. simulans -like reads even with sequencing errors. We further eliminated D. simulans -like sequences by mapping remaining reads to all available D. simulans genomes and large contig sequences [Drosophila Population Genomics Project; DPGP, http://www.dpgp.org/] using the program BWA and allowing 3% mismatches. The additional D. simulans sequences were obtained from the DPGP website http://www.dpgp.org/ and included the genomes of six D. simulans strains [w501, C167, MD106, MD199, NC48 and sim4+6; [17]] as well as contigs not mapped to chromosomal locations. After removing reads potentially from D. simulans we wanted to obtain a set of reads that mapped to one parental strain and not to the other (informative reads). We first generated a set of reads that mapped to at least one of the parental reference sequences with zero mismatches and no indels. At this point we split the analyses into the different chromosome arms. To obtain informative reads for a chromosome we removed all reads that mapped to our marked sequences from any other chromosome arm in D. melanogaster, using MOSAIK to map to our marked reference sequences (the strain used in the cross as well as from any other sequenced parental strain) and using BWA to map to the D. melanogaster reference genome. We then obtained the set of reads that uniquely map to only one D. melanogaster parental strain with zero mismatches to the marked reference sequence of the chromosome arm under study in one parental strain but not in the other, and vice versa, using MOSAIK. Reads that could be miss-assigned due to residual heterozygosity or systematic Illumina errors would be removed in this step. Before considering the remaining reads as informative markers we incorporated additional filtering steps. We removed all reads that would differentially map to one parental reference sequence and not to the other if one of the reference sequences corresponding to this read contained one or more ‘N’s. Although the dominant error type in Illumina sequencing is substitutions (as opposed to indels), indels are more difficult to deal with and to incorporate uncertainty and therefore we removed reads that would differentially map to one parental reference sequence and not to the other if the parental reference sequences at this region differ by an indel. We also removed reads that showed nucleotide differences between parental strains at the 5′ or 3′-end of the read. Finally, we considered informative reads only those that distinguish between parental reference sequences with a maximum of 3 single-nucleotide, non-consecutive differences. The informative reads from each genotyped hybrid female were used to generate D. melanogaster (possibly recombinant) haplotypes for each chromosome arm. Detection of CO and GC events We mapped CO and GC events directly to each individual D. melanogaster haplotype (from a single RAIL hybrid) and not based on the combined analysis of all D. melanogaster haplotypes for a given chromosomal region. That is, most chromosomes only show 1 to 4 (in the case of several generations of RAIL) CO events. Due to the elevated density of markers each CO is supported by numerous (hundreds and often thousands) contiguous markers at either side and therefore we expect to have detected all COs. GC events on the other hand are supported by single or a few adjacent markers that do not extend over long stretches of DNA (i.e., much shorter than 25 kb). In principle, double CO in a single meiosis (or two independent COs in different meiosis) could be mistaken for long GC if they were very close to each other. GC events are assumed to be short, often shorter than 500 bp and be exceptionally rare above 10–15 kb [62]. We analyzed marker maps to locate CO and GC events along single chromosomes, using a cut-off for maximum tract length for GC (LGC ) of 15 kb. Several lines of evidence suggest that this approach classifies correctly CO and GC events based on our experimental design. First, equivalent maps for CO and GC were obtained when applying a cut-off of 25 kb, suggesting that at 15 kb we are classifying as GC most if not all detectable GC events and that, when two or more CO events occur in the same chromosome in our RAILs, these COs are separated by more than 25 kb. Second, crosses involving several generations of RAIL show equivalent number of CO per chromosome per female meiosis to crosses based on a single meiosis. Finally, simulations of CO distribution along chromosomes following the mating protocol used to generate RAILs, with a cut-off of 15 kb to assign GC and the conservative assumption of no CO interference shows a maximum erroneous assignment of 0.16% and 1.4% assuming random distribution or the observed distribution of CO, respectively. We detected a total of 32,511 CO events and CO maps for each cross and chromosome arm were generated by directly combining the observed COs from all individual haplotypes and tabulated along each chromosome in terms of c [centimorgans (cM) per megabase (Mb) per female meiosis]. Estimates of gene conversion initiation rates (γ) and GC tract length (LGC ) Our study revealed a total of 74,453 GC events. Nevertheless, a fraction of GC events are expected to be missed due to GC tracts that lay between adjacent markers. Moreover, this underestimation is predicted to be variable across the genome due to differences in SNP and marker density. Our data consists of a great many independent GC events distributed across different haplotypes for a given chromosome, each GC event likely defined by different SNPs and a different distance from adjacent SNPs. The nature of this dataset differs from previous population genetic studies of gene conversion [94], [145] as well as from experimental studies that based their results on genetic crosses that directly detected presence/absence of GC events using a limited number of informative markers and/or focused on a specific genomic region [62], [112]. SNPs not involved in GC events, each separated by a different distance from adjacent SNPs, are also informative about the rate of GC initiation (γ) and length of GC tracts (LGC ). We therefore expanded a previous maximum likelihood algorithm [62] to estimate simultaneously γ and LGC and to be applicable to any region of arbitrary size with variable density SNP/marker data that takes into account both observed GC events and markers not involved in GC events. Each observed, unselected, GC tract will be treated as a different event defined by the outmost markers (left and right nucleotides) of the observed tract that describe the minimum true tract length (L min; L min≥1). We also know that a tract has a left end and a right end delimited by the nearest left/right flanking markers not involved in the tract, with mgc indicating the average number of nucleotides between the observed tract and the left and right flanking markers. The maximum tract length (L max) is then L min+2(mgc ). Following Hilliker et al. (1994) [62], gene conversion tract lengths can be described by a geometric distribution that assumes independence of each nucleotide-adding step with a probability φ. The probability of a GC tract of length n nucleotides can be described by with the mean tract length The likelihood of an observed GC event that encompasses the observed tract is then Markers not involved in GC tracts either due to no GC event or because GC tracts initiate and terminate between two 2 markers are also informative. These markers are separated by m nucleotides and we preserve the possibility that m differs from mgc . Let 1- φn denote the probability of a GC tract shorter than n nucleotides. Then For a complete dataset with k GC events and t markers not being involved in GC events, the total Likelihood of the data is or its log for convenience. Finally we can obtain numerically the Maximum Likelihood Estimate (MLE) of γ and LGC using the log-likelihood function for our dataset(s). We have applied this approach to estimate γ and length LGC for the whole genome as well as for each and along chromosome arms. Validation In silico False Discovery Rate (FDR) analysis Although we have strived for designing a protocol that includes a hefty number of filters and mapping controls, we anticipate a non-zero rate of misplacing reads given the massive number of reads obtained per cross. We estimated our false discovery rate (FDR) for CO and GC events by generating random collections of Illumina reads when there is no expectation of detecting any recombination (CO or GC) event. We applied the same bioinformatic pipeline used to identify informative markers, generate D. melanogaster haplotypes and ultimately identify CO and GC events and estimate c and γ. We investigated the efficacy of our filtering/mapping protocol by generating collections of reads with 50% of reads from a single parental D. melanogaster (eg, RAL-208) and 50% of reads from the D. simulans strain used in all crosses (Florida City) to closely represent the reads from a single hybrid female fly when there is no expectation for any CO or GC event. The reads used for this study were obtained from our Illumina sequencing effort of parental D. melanogaster and the D. simulans strains used in this study (see above) and were used with no a priori knowledge of their sequence and mapping quality, Each in silico library is, on average, equivalent to individual hybrid libraries in terms of number of reads with the only difference that we removed the first 8 nucleotides of each read from the parental lines (equivalent to the removal of the 5′ (7 nt+‘T’) tag in our multiplexed hybrid reads). This approach to estimate FDR takes into account possible limitations in the filtering and mapping algorithms and protocols, Illumina sequencing errors (random and non-random), the effects of non-complete or inaccurate reference sequences and the bioinformatic pipeline. We generated 400 in silico random library collections (the average number of libraries per cross), applied the same bioinformatic pipeline and parameters used for the filtering and mapping of reads from our crosses and estimated CO and GC rates. Because the expectation is zero for both CO and GC we can compare these rates to those from actual crosses to obtain a suitable FDR. Our results show that no CO event would be inferred when using only one D. melanogaster parental strain and D.simulans (zero events in all 400 in silico libraries compared to the more than 2,000 detected per cross). GC events are however detected. Overall, we can infer that 4.1% of our inferred GC events can be explained by miss-assigned reads and that most of these erroneously mapped reads are from the D. melanogaster strain, not from the parental D.simulans. This FDR varies among chromosomes, highest and lowest for the 3R (6.2%) and X (1.9%) chromosome arms, respectively. Zero GC events (in 400 in silico libraries) were inferred in the small chromosome 4. Worthy of note, preliminary FDR studies based on more straightforward filtering/mapping protocols revealed higher percentages of miss-assigned reads that would overestimate the number of GC events. These results guided us to include additional steps to our bioinformatic pipeline. The additional steps included in our final protocols (see above) consisted in, 1) increased filtering stringency of reads before mapping, 2) use of additional D. simulans sequences and contigs (additional to the Florida City reference) for ‘trapping’ purposes of D. simulans -like sequences, 3) removal of reads mapping to chromosomes other than the one under study (with permissive mapping allowing up to 10% mismatch), and 4) removal of reads that differentiate parental sequences if these sequences differ by an indel. Experimental PCR/sequencing of gene conversion events When we extracted DNA from single hybrid females we froze ∼10% of the total DNA for posterior validation of markers, crossing over (CO) or gene conversion (GC) events if needed. We designed primers, PCR amplified and sequenced the genomic regions surrounding 31 GC events detected based on our bioinformatic pipeline. These 31 GC events were chosen to represent GC events assessed based on a single informative marker, in the fourth chromosome or with long GC tract (>2 kb). 30 out of these 31 showed SNPs patterns confirming the presence of a GC event. Five of these confirmed GC events were located in the fourth chromosome. Overall the GC events ranged in tract length, from 171 bp to a maximum of 9,585, the latter located in the X chromosome (ca. position 2.6 Mb). Estimates of recombination heterogeneity along chromosomes Heterogeneity in recombination events (CO or GC) along chromosome arms was investigated as the probability of observing equal or greater coefficient of variation (CV) among non-overlapping regions. The size of the genomic regions ranged from 100-kb up to half the chromosome length (10 Mb) in 100-kb increments. Probabilities were obtained based on 10,000 replicates per window size and chromosome, and assuming a random distribution of events. Heterogeneity was investigated for whole chromosomes and after removing centromeric and telomeric regions with visibly reduced CO rates (see below). Estimates of variation in recombination maps among crosses We studied intra-specific variation in CO rates by estimating the variance to mean ratio, or Index of Dispersion (R), and tested whether the observed variance in CO events among crosses can be explained by a Poisson process. This approach assumes that for a given chromosomal region the number of observed CO events in different crosses is a set of independent and identically distributed random variables. In the case of Poisson distributed data, the expectation is R∼1. To focus our study on variation in the distribution of CO rates and take into account genome-wide differences between crosses and the different number of meiosis analyzed, we used weighting factors (w) that include the total number of CO per chromosome in a particular cross as ‘lineage effects’. The weighting factor of cross i for a given chromosome (wi ) is then with mi representing the total number of CO events in a chromosome in cross i, and n the number of crosses analyzed (in our case n = 8). Following Gillespie (1989) [146], we can obtain the index of dispersion for CO events (R CO) for region r based on with where , and indicate the mean number of recombination events at region r, the number of CO events in region r for cross i, and the unbiased estimator of the variance at region r, respectively. Assuming a Poisson distribution of CO events, R CO values significantly greater than 1 would indicate an excess variance (overdispersion) of CO events among crosses for the region under study. We estimated the statistical significance of R CO values being greater than expected using simulations that assume a Poisson-distributed total number of CO events that are randomly distributed among crosses (once corrected for lineage effects) and along chromosomes. Note that the variance due to binomial sampling is greater than that for Poisson, thus the expectation for binomial distributed data is R<1, making our approach conservative when assessing overdispersion of CO events among crosses. Probabilities were obtained based on 100,000 independent replicates. R CO along chromosomes was investigated for non-overlapping regions with sizes ranging from 250 kb to 1 Mb. Local distribution of CO and GC events Because we focused on CO and GC events delimited by 500 bp or less (CO500 and GC500), heterogeneity in the density of informative markers across the genome could bias our expected, null, distribution of CO500 and GC500 events. Indeed, informative markers are 1.4% more likely to be located in intergenic regions than within genic regions. We then used this “marker-density corrected” approach to obtain the expected frequency of CO500 and GC500 recombination events within genic sequences (0.607 instead of 0.637 obtained when assuming random distribution of markers). DNA motifs We used MEME (version 4.6.1) [147], a software for discovering motifs in sets of DNA sequences, to search for motifs enriched in sequences around CO and GC events. We focused on 1,909 CO and 3,701 GC events delimited by 500 bp or less (CO500 and GC500). When the length of the sequence delimited by adjacent markers was less than 500 bp we extended the sequence both at the 5′ and 3′ ends up to a total of 500 bp. We searched for motifs with nucleotide size ranging between 5 and 20 nucleotides and used as a background sequence model one that takes into account both nucleotide and dinucleotide composition (first-order Markov model), the latter important to capture the observed frequency of dinucleotide repeats across the genome. MEME was applied to search for motifs on both the given DNA strand and the reverse complement strand and allowing any number of repetitions per sequence using the complete 500 bp sequences. We also used MEME to generate sequence LOGOS for each discovered motif. Figure 6 and Figure 7 show the 18 and 10 motifs enriched in CO500 and GC500, respectively, with corrected probabilities (E-values) smaller than 10−10. E-values represent the expected number of sequences in a random database of the same size that would match the motifs as well as the sequence does. To obtain comparable E-values for CO and GC events we applied MEME to a random set of 1,000 CO500 and 1,000 GC500 sequences. We also investigated the commonness of the motifs with E-value<10−10 within CO500 and GC500 sequences (Figure S4). Centromeric and telomeric regions with reduced CO rates For the sake of analyzing the distribution of recombination rates along chromosomes and their possible causes or consequences we designated centromeric and telomeric regions with visibly reduced CO (see Figure 1) as possibly influencing some of our analyses (eg, heterogeneity along chromosomes, CO vs GC rates, local distribution of CO and GC events, etc.). Thus most analyses were performed using whole chromosomes as well as after removing the fourth chromosome (which has no detectable CO) and these long, broadly defined peripheral regions (see Table 3). 10.1371/journal.pgen.1002905.t003 Table 3 Genomic regions investigated after removing centromeric and telomeric regions. Chromosome arm Genomic region* X 2.3 — 20.8 Mb 2L 0.5 — 17.4 Mb 2R 5.2 — 20.8 Mb 3L 0.7 — 19.9 Mb 3R 9.7 — 26.9 Mb * Genomic location based on the nucleotide position in the D. melanogaster reference (r.5.3) genome sequence. Nucleotide composition across the D. melanogaster genome We investigated the possible relationship between recombination rates (c or γ) and nucleotide composition across the D. melanogaster genome to test the hypothesis of a bias in heteroduplex repair with AT/GC heterozygotes being preferentially repaired towards G and C nucleotides (aka, gene conversion bias). Our analyses of nucleotide composition in the D. melanogaster genome were based on the D. melanogaster reference sequence (release 5.3, January 2011). We annotated all gene models (not only protein coding genes), transposable elements and repetitive sequences onto the reference sequence and marked any overlapping annotation. To test for gene conversion bias in D. melanogaster, we focused on G+C nucleotide composition at intergenic and intronic sites using adjacent 100-kb regions. Intergenic sites were defined as those sites between any annotated gene model (thus also excluding annotated UTRs), transposable elements or repetitive sequences. Intronic sites were defined as those that are only associated with an intron definition in protein coding genes (i.e., sites that are either ‘intron’ or ‘exon’ depending on alternatively spliced forms were not considered) and do not overlap with any other annotation. Nucleotide polymorphism across the D. melanogaster genome For the sake of obtaining information on polymorphism levels across the genome we used DGRP sequenced strains [142], all from the same natural population (Raleigh, NC, USA) than the RAL strains used in this study, and added our sequencing data from the parental RAL strains. We also included the sequenced strains MDG and PNG in the overall analysis of polymorphism. Among all DGRP strains, we chose the 34 strains with 454-Roche as well as Illumina sequencing reads to maximize SNP accuracy. DGRP sequencing reads were obtained from NCBI short read archive (SRA) study SRP000694. For each strain, we obtained reference sequences using the same mapping procedure described above now using BWA-SW [148] to map additional 454 reads. We called SNPs relative to the D. melanogaster genome reference when SNP quality was greater than Q40 and generated consensus sequences without ‘marking’ heterozygous sites. We estimated nucleotide polymorphism as the pairwise nucleotide variation per site (π), with X-linked values adjusted to generate comparable estimates to autosomal regions. To test the hypothesis of a heterozygosity-dependent GC∶CO repair bias imposed after DSB formation in D. melanogaster, we investigated the relationship between polymorphism and the ratio γ/c or γ across the genome using total π. To study the relationship between polymorphism and c we focused on polymorphism at noncoding sites (intergenic and introns). Analyses comparing estimates of π with either γ/c, γ or c across the genome were based on values from 100-kb adjacent windows. Data availability Estimates of recombination reported in this study are publicly available, for any genomic region or gene in D. melanogaster, from www.recombination.biology.uiowa.edu and www.recombinome.com. Supporting Information Figure S1 Double-strand repair (DSB) pathway and recombination during meiosis. The DSB repair via double Holliday junction can generate either crossover (CO) or non-crossover (gene conversion, GC) events while the Holliday junction-independent repair (synthesis-dependent strand annealing, SDSA) mechanism causes only GC events. (TIF) Click here for additional data file. Figure S2 Crossing over rates (c) along chromosome arms based on nth polynomial equations relating the physical position to c. A single polynomial has been applied per chromosome arm, with the physical position (x) in Mb and c (y) in cM/Mb. The equations best fitting adjacent 100-kb windows are the following: (TIF) Click here for additional data file. Figure S3 Relationship between sample mean and sample median for rates of crossing over (c). (A) Sample mean and sample median for c across the D. melanogaster genome estimated for adjacent 250-kb windows and shown separately for each chromosome arm. c indicates cM/Mb per female meiosis. (B) Blowup of the relationship between c based on sample mean and c based on sample median when c (sample mean) is smaller than 3 cM/Mb. (TIF) Click here for additional data file. Figure S4 Cumulative percentage of sequences with CO and GC motifs. Percentage of sequences with one or more of the 18 CO and 10 GC motifs found to be overrepresented in CO500 and GC500 sequences, respectively (see Figure 7 and Figure 8 for details). (TIF) Click here for additional data file. Figure S5 Schematic representation of the crossing methodology used in this study to generate recombinant chromosomes between D. melanogaster parental strains. (TIF) Click here for additional data file. Table S1 24 custom-designed 7-nucleotides tags. (DOC) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Regulation of gene expression via the core promoter and the basal transcriptional machinery.

              The RNA polymerase II core promoter is a structurally and functionally diverse transcriptional regulatory element. There are two main strategies for transcription initiation - focused and dispersed initiation. In focused initiation, transcription starts from a single nucleotide or within a cluster of several nucleotides, whereas in dispersed initiation, there are several weak transcription start sites over a broad region of about 50 to 100 nucleotides. Focused initiation is the predominant means of transcription in simpler organisms, whereas dispersed initiation is observed in approximately two-thirds of vertebrate genes. Regulated genes tend to have focused promoters, and constitutive genes typically have dispersed promoters. Hence, in vertebrates, focused promoters are used in a small but biologically important fraction of genes. The properties of focused core promoters are dependent upon the presence or absence of sequence motifs such as the TATA box and DPE. For example, Caudal, a key regulator of the homeotic gene network, preferentially activates transcription from DPE- versus TATA-dependent promoters. The basal transcription factors, which act in conjunction with the core promoter, are another important component in the regulation of gene expression. For instance, upon differentiation of myoblasts to myotubes, the cells undergo a switch from a TFIID-based transcription system to a TRF3-TAF3-based system. These findings suggest that the core promoter and basal transcription factors are important yet mostly unexplored components in the regulation of gene expression. Copyright 2009 Elsevier Inc. All rights reserved.
                Bookmark

                Author and article information

                Contributors
                miriam.merenciano@ibe.upf-csic.es
                camillo.iacometti@edu.unito.it
                josefa.gonzalez@ibe.upf-csic.es
                Journal
                Mob DNA
                Mob DNA
                Mobile DNA
                BioMed Central (London )
                1759-8753
                13 March 2019
                13 March 2019
                2019
                : 10
                : 10
                Affiliations
                [1 ]ISNI 0000 0001 2172 2676, GRID grid.5612.0, Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), ; Passeig Maritim de la Barceloneta 37,49, 08003 Barcelona, Spain
                [2 ]ISNI 0000 0001 2336 6580, GRID grid.7605.4, Department of Life Sciences and Systems Biology, , University of Turin, ; Turin, Italy
                Author information
                http://orcid.org/0000-0001-9824-027X
                Article
                152
                10.1186/s13100-019-0152-9
                6415491
                30911338
                d7f228c9-2170-4904-b276-257bf10b667b
                © The Author(s). 2019

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 21 December 2018
                : 28 February 2019
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100010663, H2020 European Research Council;
                Award ID: H2020-ERC-2014-CoG-647900
                Categories
                Short Report
                Custom metadata
                © The Author(s) 2019

                Genetics
                transposable element,fecundity,viability,target site duplication,recurrent insertion,natural population

                Comments

                Comment on this article