46
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The complete sequence of the smallest known nuclear genome from the microsporidian Encephalitozoon intestinalis

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The genome of the microsporidia Encephalitozoon cuniculi is widely recognized as a model for extreme reduction and compaction. At only 2.9 Mbp, the genome encodes approximately 2,000 densely packed genes and little else. However, the nuclear genome of its sister, Encephalitozoon intestinalis, is even more reduced; at 2.3 Mbp, it represents a 20% reduction from an already severely compacted genome, raising the question, what else can be lost? In this paper, we describe the complete sequence of the E. intestinalis genome and its comparison with that of E. cuniculi. The two species share a conserved gene content, order and density over most of their genomes. The exceptions are the subtelomeric regions, where E. intestinalis chromosomes are missing large gene blocks of sequence found in E. cuniculi. In the remaining gene-dense chromosome 'cores', the diminutive intergenic sequences and introns are actually more highly conserved than the genes themselves, suggesting that they have reached the limits of reduction for a fully functional genome.

          Abstract

          A comparison of related genomes provides valuable information about how they evolve. Here, the complete sequence of the smallest known nuclear genome from the microsporidia E. intestinalis is described and compared with its larger sister E. cuniculi, revealing what parts are indispensable in even the most reduced genomes.

          Related collections

          Most cited references27

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database

          Motivation: Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Results: Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Availability: Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/ Contact: artemis@sanger.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans

            Introduction Given the long history of Drosophila as a central model system in evolutionary genetics beginning with the origins of empirical population genetics in the 1930s, it is unsurprising that Drosophila data have inspired the development of methods to test population genetic theories using DNA variation within and between closely related species [1–4]. These methods rest on the supposition of the neutral theory of molecular evolution that polymorphism and divergence are manifestations of mutation and genetic drift of neutral variants at different time scales [5]. Under neutrality, polymorphism is a “snapshot” of variation, some of which ultimately contributes to species divergence as a result of fixation by genetic drift. Natural selection, however, may cause functionally important variants to rapidly increase or decrease in frequency, resulting in patterns of polymorphism and divergence that deviate from neutral expectations [1,2,6]. A powerful aspect of inferring evolutionary mechanism in this population genetic context is that selection on sequence variants with miniscule fitness effects, which would be difficult or impossible to study in nature or in the laboratory but are evolutionarily important, may cause detectable deviations from neutral predictions. Another notable aspect of these population genetic approaches is that they facilitate inferences about recent selection—which may be manifest as reduced polymorphism or elevated linkage disequilibrium—or about selection that has occurred in the distant past—which may be manifest as unexpectedly high levels of divergence. The application of these conceptual advances to the study of variation in closely related species has resulted in several fundamental advances in our understanding of the relative contributions of mutation, genetic drift, recombination, and natural selection to sequence variation. However, it is also clear that our genomic understanding of population genetics has been hobbled by fragmentary and nonrandom population genetic sampling of genomes. Thus, the full value of genome annotation has not yet been applied to the study of population genetic mechanisms. Combining whole-genome studies of genetic variation within and between closely related species (i.e., population genomics) with high-quality genome annotation offers several major advantages. For example, we have known for more than a decade that regions of the genome experiencing reduced crossing over in Drosophila tend to show reduced levels of polymorphism yet normal levels of divergence between species [7–10]. This pattern can only result from natural selection reducing levels of polymorphism at linked neutral sites, because it violates the neutral theory prediction of a strong positive correlation between polymorphism and divergence [5]. However, we have no general genomic description of the physical scale of variation in polymorphism and divergence in Drosophila and how such variation might be related to variation in mutation rates, recombination rates, gene density, natural selection, or other factors. Similarly, although several Drosophila genes have been targets of molecular population genetic analysis, in many cases, these genes were not randomly chosen but were targeted because of their putative association with phenotypes thought to have a history of adaptive evolution [11,12]. Such biased data make it difficult to estimate the proportion of proteins diverging under adaptive evolution. In a similar vein, the unique power of molecular population genetic analysis, when used in concert with genome annotation, could fundamentally alter our notions about phenotypic divergence due to natural selection. This is because our current understanding of phenotypic divergence and its causes is based on a small and necessarily highly biased description of phenotypic variation. Alternatively, a comprehensive genomic investigation of adaptive divergence could use genome annotations to reveal large numbers of new biological processes previously unsuspected of having diverged under selection. Here we present a population genomic analysis of D. simulans. D. simulans and D. melanogaster are closely related and split from the outgroup species, D. yakuba, several million years ago [13–15]. The vast majority of D. simulans and D. yakuba euchromatic DNA is readily aligned to D. melanogaster, which permits direct use of D. melanogaster annotation for investigation of polymorphism and divergence and allows reliable inference of D. simulans–D. melanogaster ancestral states over much of the genome. Our analysis uses a draft version of a D. yakuba genome assembly (aligned to the D. melanogaster reference sequence) and a set of light-coverage, whole-genome shotgun data from multiple inbred lines of D. simulans, which were syntenically aligned to the D. melanogaster reference sequence. Results/Discussion Genomes and Assemblies Seven lines of D. simulans and one line of D. yakuba were sequenced at the Washington University Genome Sequencing Center (the white paper can be found at http://www.genome.gov/11008080). The D. simulans lines were selected to capture variation in populations from putatively ancestral geographic regions [16], recent cosmopolitan populations, and strains encompassing the three highly diverged mitochondrial haplotypes previously described for the species [17]. These strains have been deposited at the Tucson Drosophila Stock Center (http://stockcenter.arl.arizona.edu). A total of 2,424,141 D. simulans traces and 2,245,197 D. yakuba traces from this project have been deposited in the National Center for Biotechnology Information (NCBI) trace archive. D. simulans syntenic assemblies were created by aligning trimmed, uniquely mapped sequence traces from each D. simulans strain to the euchromatic D. melanogaster reference sequence (v4). Two strains from the same population, sim4 and sim6, were unintentionally mixed prior to library construction; reads from these strains were combined to generate a single, deeper, syntenic assembly (see Materials and Methods), which is referred to as SIM4/6. The other strains investigated are referred to as C167.4, MD106TS, MD199S, NC48S, and w501 . Thus, six (rather than seven) D. simulans syntenic assemblies are the objects of analysis. Details on the fly strains and procedures used to create these assemblies, including the use of sequence quality scores, can be found in Materials and Methods. The coverages (in Mbp) for C167.4, MD106TS, MD199S, NC48S, SIM4/6, and w501 , are 56.9, 56.3, 63.4, 42.6, 89.8, and 84.8, respectively. A D. yakuba strain Tai18E2 whole-genome shotgun assembly (v2.0; http://genome.wustl.edu/) generated by the Parallel Contig Assembly Program (PCAP) [18] was aligned to the D. melanogaster reference sequence (Materials and Methods). The main use of the D. yakuba assembly was to infer states of the D. simulans–D. melanogaster ancestor. For many analyses, we used divergence estimates for the D. simulans lineage or the D. melanogaster lineage (from the inferred D. simulans–D. melanogaster ancestor) rather than the pairwise (i.e., unpolarized) divergence between these species. These lineage-specific estimates are often referred to as “D. simulans divergence,” “D. melanogaster divergence,” or “polarized divergence.” A total of 393,951,345 D. simulans base pairs and 102,574,197 D. yakuba base pairs were syntenically aligned to the D. melanogaster reference sequence. Several tens of kilobases of repeat-rich sequences near the telomeres and centromeres of each chromosome arm were excluded from our analyses (Materials and Methods). D. simulans genes were conservatively filtered for analysis based on conserved physical organization and reading frame with respect to the D. melanogaster reference sequence gene models (Materials and Methods). We took this conservative approach so as to retain only the highest quality D. simulans data for most inferences. The number of D. simulans genes remaining after filtering was 11,466. Ninety-eight percent of coding sequence (CDS) nucleotides from this gene set are covered by at least one D. simulans allele. The average number of lines sequenced per aligned D. simulans base was 3.90. For several analyses in which heterozygosity and divergence per site were estimated, we further filtered the data so as to retain only genes or functional elements (e.g., untranslated regions [UTRs]) for which the total number of bases sequenced across all lines exceeded an arbitrary threshold (see Materials and Methods). The numbers of genes for which we estimated coding region expected heterozygosity, unpolarized divergence, and polarized divergence were 11,403, 11,439, and 10,150, respectively. Coverage on the X chromosome was slightly lower than autosomal coverage, which is consistent with less X chromosome DNA than autosomal DNA in mixed-sex DNA preps. Variable coverage required analysis of individual coverage classes (n = 1–6) for a given region or feature, followed by estimation and inference weighted by coverage (Materials and Methods). The D. simulans syntenic alignments are available at http://www.dpgp.org/. An alternative D. simulans “mosaic” assembly, which is available at http://www.genome.wustl.edu/, was created independently of the D. melanogaster reference sequence. General Patterns of Polymorphism and Divergence Nucleotide variation. We observed 2,965,987 polymorphic nucleotides, of which 43,878 altered the amino acid sequence; 77% of sampled D. simulans genes were segregating at least one amino acid polymorphism. The average, expected nucleotide heterozygosity (hereafter, “heterozygosity” or “πnt”) for the X chromosome and autosomes was 0.0135 and 0.0180, respectively. X chromosome πnt was not significantly different from that of the autosomes (after multiplying X chromosome πnt by 4/3, to correct for X/autosome effective population size differences when there are equal numbers of males and females; see [19]). However, X chromosome divergence was greater than autosomal divergence in all three lineages (50-kb windows; Table 1, Table S1, Figure 1, Dataset S8). We will discuss this pattern in greater detail below. Table 1 Autosome and X Chromosome Weighted Averages of Nucleotide Heterozygosity (π) and Lineage Divergence Figure 1 Patterns of Polymorphism and Divergence of Nucleotides along Chromosome Arms Nucleotide π (blue) and div on the D. simulans lineage (red) in 150-kbp windows are plotted every 10 kbp. χ[–log(p)] (olive) as a measure of deviation (+ or –) in the proportion of polymorphic sites in 30-kbp windows is plotted every 10 kbp (see Materials and Methods). C and T correspond to locations of centromeres and telomeres, respectively. Chromosome arm 3R coordinates correspond to D. simulans locations after accounting for fixed inversion on the D. melanogaster lineage. Not surprisingly, many patterns of molecular evolution identified from previously published datasets were confirmed in this genomic analysis. For example, synonymous sites and nonsynonymous sites were the fastest and slowest evolving sites types, respectively [20–24]. Nonsynonymous divergence (dN) and synonymous divergence (dS) were positively, though weakly, correlated (r 2 = 0.052, p 6 for each of polymorphisms, fixations, synonymous variants, and nonsynonymous variants (Dataset S1). The filtered data set of unpolarized MK tests contained 6,702 genes, of which 1,270 (19%) were significant (in the direction of adaptive evolution) at the 0.05 critical value and 539 (8%) genes were significant at a 0.01 critical value. Given that MK tests can only detect directional selection when multiple beneficial mutations have fixed, these results provide a conservative view of the prevalence of adaptive protein divergence. There was a slight enrichment of significant unpolarized MK tests on the autosomes relative to the X chromosome (Fisher's Exact test, p = 0.0014). However, conclusions regarding the incidence of directional selection on autosomes versus the X chromosome should be tempered by the fact that the average numbers of polymorphic and fixed variants per gene may differ between the two types of chromosomes, which affects the power of the MK test to reject neutrality. We observed no enrichment of significant tests in regions of the X chromosome hypothesized to experience greater versus lower rates of crossing over. Several of the most highly significant MK test statistics are from genes with known functions and in many cases, known names and mutant phenotypes. More generally, among the genes with no associated GO term, a smaller proportion had significant unpolarized MK tests compared to the proportion for genes associated with one or more GO terms (0.16 versus 0.20, p = 3 × 10−5). Included among the most highly significant genes in the unpolarized MK tests (Table S12) were several with reproduction-related functions. For example, the sperm of males carrying mutations in Pkd2 (CG6504), the gene with the smallest MK p-value in the genome, are not properly stored in females, suggesting sperm–female interactions (perhaps associated with sperm competition) as a possible agent of selection [92,93]. Other examples include Nc (CG8091), which plays a role in sperm individualization [94]; Acxc (CG5983), a sperm-specific adenylate cyclase [95]; and Dhc16F (CG7092), which is a component of the axonemal dynein complex (suggesting a possible role of selection on sperm motility). For polarized MK tests, we used the D. yakuba genome to infer which fixed differences between D. simulans and D. melanogaster occurred along the D. simulans lineage (Materials and Methods). These fixations were then compared to D. simulans polymorphisms. This reduced, filtered dataset contained 2,676 genes of which 384 (14%) and 169 (6%) were significant at the 0.05 and 0.01 levels, respectively (deviating in the direction of adaptive evolution; Datasets S1). Twenty-three genes showed evidence of a significant (p 10 genes include nuclear envelope, nuclear pore, amino acid-polyamine transporter activity, ubiquitin-specific protease activity, protein deubiquitination, and protein import into the nucleus. Results from a comparable analysis of D. melanogaster protein evolution are shown in Table S21. Using the same criteria of n > 10 genes and p 2) indicative of directional selection on 5′ UTRs, 3′ UTRs, and intron sequences, respectively. Among the most unusual 5′UTRs are those associated with genes coding for proteins associated with the cytoskeleton or the chromosome, categories that also appeared as unusual in the MK tests on protein variation. Two of the top-ten 3′ UTRs are associated with the SAGA complex, a multi-subunit transcription factor involved in recruitment of RNA Pol II to the chromosome [111]. Among the extreme introns, two are from genes coding for components of the ABC transporter complex and two are from genes coding for centrosomal proteins, again pointing to the unusual evolution of genes associated with the cytoskeleton and chromosome structure and movement. As previously noted, a large number of significant UTRs deviate in the direction of excess polymorphism (relative to synonymous mutations). Given the potential importance of the UTRs in regulating transcript abundance and localization, translational control, and as targets of regulatory microRNAs [112], such UTRs could be attractive candidates for functional investigation. Contingency tests of significant versus nonsignificant MK test for amino acids versus each of the noncoding elements yielded p-values of 0.65, 0.04, and 0.07 for 5′ UTRs, 3′ UTRs, and introns, respectively. Thus, there is weak evidence that genes under directional selection on amino acid sequences tend to have 3′ UTRs and introns influenced by directional selection as well. Whole-Genome Analysis of Polymorphic and Fixed Variants Up to this point, our analyses have investigated various attributes of polymorphism and divergence based on windows or genes. An alternative approach for understanding the causes of variation and divergence is to analyze polymorphism and divergence across site types. Table 2 shows whole-genome counts of polymorphic and polarized fixed variants for UTRs, synonymous sites, nonsynonymous sites, introns, and intergenic sites. We also provide data for polarized, synonymous preferred or unpreferred variants. Almost all preferred versus unpreferred codons in D. melanogaster end in GC versus AT, respectively [113]; thus, preferred versus unpreferred codons can be thought of as GC-ending versus AT-ending codons. Table 2 Whole-Genome Counts of Polarized Polymorphic and Fixed Variants Nonsynonymous sites showed the smallest ratio of polymorphic-to-fixed variants, which is consistent with the MK tests and supports the idea that such sites are the most likely to be under directional selection. Nonsynonymous polymorphisms also occur at slightly lower frequency than do noncoding variants (Table S25). Synonymous sites have the highest ratio of polymorphic-to-fixed variants, which supports the previously documented elevated ratio of polymorphic-to-fixed unpreferred synonymous variants in D. simulans [89]. The confidence intervals of the ratio of polymorphic-to-fixed variants among site types are nonoverlapping with the exception of intron and intergenic sites. If preferred synonymous mutations are, on average, beneficial [89,114], then the smaller polymorphic-to-fixed ratio for nonsynonymous and UTR variants versus preferred variants implies that a large proportion of new nonsynonymous and UTR mutations are beneficial. Using similar reasoning, the data in Table 2 suggest that directional selection plays a larger role in nonsynonymous and UTR divergence compared to intron and intergenic divergence [20,115,116]. These conclusions are consistent with estimates of α [11,117], the proportion of sites fixing under directional selection (assuming that synonymous sites are neutral and at equilibrium) for different site types. Base Composition Evolution Determining the relative contributions of various mutational and population genetic processes to base composition variation and inferring the biological basis of selection on base composition remain difficult problems. Much of the previously published data on base composition variation in D. simulans have been from synonymous sites [55,89,90,118]. Several lines of evidence [55,89,90,113,118] suggest that on average, preferred codons have higher fitness than unpreferred codons, with variation in codon usage being maintained by AT-biased mutation, weak selection against unpreferred codons, and genetic drift [23,114]. However, the possibilities of nonequilibrium mutational processes and/or natural selection favoring different base composition in different lineages have also been addressed [119]. The D. simulans population genomic data allow for a thorough investigation of the population genetics and evolution of base composition for both coding and noncoding DNA [59,120]. The analyses discussed below use parsimony to polarize polymorphic and fixed variants. Complete genomic and gene-based data are available as Datasets S9 and S10. Synonymous sites. Previous reports suggested that D. simulans synonymous sites are evolving towards higher AT content, although the excess of AT over GC fixations is small [55]. That trend was confirmed in this larger dataset; there are many more ancestral preferred codons that have fixed an unpreferred codon (coverage classes four–six, n = 21,156) in D. simulans compared with ancestral unpreferred codons that have fixed a preferred codon (coverage classes four–six, n = 15,409). Furthermore, the population genomic data also support previous reports [89] that D. melanogaster synonymous sites are becoming AT-rich at a faster rate than D. simulans synonymous sites (Table S26), contributing to the higher median dS in D. melanogaster (0.069) compared to D. simulans (0.051, Wilcoxon Signed Rank, p 200 bp. Expected heterozygosity was also estimated for genomic features (exons, introns, UTRs, and intergenic sequence) that had a minimum size and coverage [i.e., n(n – 1) × s ≥ 100, where n = average number of alleles sampled and s = number of sites]. For coding regions, the numbers of silent and replacement sites were counted using the method of Nei and Gojobori [129]. The pathway between two codons was calculated as the average number of silent and replacement changes from all possible paths between the pair. The variance of pairwise differences in sliding windows (150-kb windows, 10-kb increments) was used as a method of summarizing the magnitude of linkage disequilibrium across the D. simulans genome. For each window, we calculated coverage weighted variance of the expected heterozygosity (see above) for all pairs of alleles. Divergence. Unpolarized (i.e., pairwise) divergence between D. simulans and D. melanogaster was estimated for 10-kb windows, 50-kb windows, 30-kb sliding windows (10-kb increments), 150-kb sliding windows (10-kb increments), 210-kb windows (10-kb increments), and genomic feature that had a minimum number of nucleotides represented [i.e., n × s > 100, with n and s as above in calculations of π. Unpolarized divergence was calculated as the average pairwise divergence at each site, which was then summed over sites and divided by the total number of sites. A Jukes-Cantor [130] correction was applied to account for multiple hits. For coding regions, the numbers of silent and replacement sites were counted using the method of Nei and Gojobori [129]. The pathway between two codons was calculated as the average number of silent and replacement changes from all possible paths between the pair. Estimates of unpolarized divergence over chromosome arms were calculated for each feature with averages weighted by the number of sites per feature. Lineage-specific divergence was estimated by maximum likelihood using PAML v3.14 [131] and was reported as a weighted average over each line with greater than 50 aligned sites in the segment being analyzed. Maximum likelihood estimates of divergence were calculated over 10-kb windows, 50-kb windows, 30-kb sliding windows (10-kb increments), 150-kb sliding windows (10-kb increments), 210-kb windows (10-kb increments), and gene features (exons, introns, and UTRs). PAML was run in batch mode using a BioPerl wrapper [132]. For noncoding regions and windows, we used baseml with HKY as the model of evolution to account for transition/transversion bias and unequal base frequencies [133]; for coding regions, we used codeml with codon frequencies estimated from the data. Insertion and deletion divergence was calculated as divi , the coverage-weighted average divergence of deletions (i = ▵) or insertions (i = ▿) per base pair. where nc is the number of aligned base pairs in the genomic region (e.g., gene feature or window) with sequencing coverage c. kcj is the number of sites in this region with coverage c at which the derived state with respect to the D. melanogaster reference sequence (▵ or ▿) occurs in j out of the c sequences. MK tests (unpolarized and polarized). Unpolarized MK tests [4] used D. simulans polymorphism data and the D. melanogaster reference sequence for counting fixed differences. Polarized MK tests used D. yakuba to infer the D. simulans/D. melanogaster ancestral state. For both polarized and unpolarized analyses, we took the conservative approach of retaining for analysis only codons for which there were no more than two alternative states. For cases in which two alternative codons differed at more than one position, we used the pathway between codons that minimized the number of nonsynonymous substitutions. This is conservative with respect to the alternative hypothesis of adaptive evolution. Polymorphic codons at which one of the D. simulans codons was not identical to the D. melanogaster codon were not included. To reduce multiple testing problems, we filtered the data to retain for further analysis only genes that exceeded a minimum number of observations; we required that each row and column in the 2 × 2 table (two variant types and polymorphic versus fixed) sum to six or greater. Statistical significance of 2 × 2 contingency tables was determined by Fisher's Exact test. MK tests were also carried out for introns and Gold Collection UTRs by comparing synonymous variants in the associated genes with variants in these functional elements. For intergenic MK tests, we used synonymous variants from genes within 5 kb of the 5′ and/or 3′boundary of the intergenic segment. For some analyses, we restricted our attention to MK tests that rejected the null in the direction of adaptive evolution. This categorization was determined following Rand and Kann [134]. Polarized 2 × 2 contingency tables were used to calculate α, which under some circumstances can be thought of as an estimate of the proportion of variants fixing under selection [11]. Bootstrap confidence intervals of α and of the ratio of polymorphic-to-fixed variants for each functional element (Table 2) were estimated in R using bias correction and acceleration [135]. Rate variation. Our approach takes overall rate variation among lineages into account when generating expected numbers of substitutions under the null model and allows for different rates of evolution among chromosome arms (e.g., a faster-X effect). For example, the number of substitutions for all X-linked 50-kb windows was estimated using PAML (baseml), allowing different rates for each lineage. All D. simulans lines were used, with the estimated substitution D. simulans rate for each window being the coverage-weighted average. This generated an empirically determined branch length of the X chromosome for the average over each of the D. simulans lines (from all three way comparisons with D. melanogaster and D. yakuba) weighted by the number of bases covered. We carried out a relative rate test for windows or features in D. simulans and D. melanogaster by generating the expected number of substitutions for each window/feature/lineage based on the branch length of the entire chromosome in each lineage (PAML) and the coverage of the window/feature in question in each lineage. We then calculated the deviation from the expected number of substitutions as (observed – expected substitutions)2/expected substitutions for any window/feature/lineage. GO by MK permutations. For each GO term associated with at least five MK tests, we calculated the proportion of significant (p Genes Located in Genomics Regions Showing Disproportionate Reductions of Nucleotide Heterozygosity in the US Sample (68 KB DOC) Click here for additional data file. Table S11 GO Terms Overrepresented in Windows from Out-of-Africa/Madagascar Analysis. MF and BP, molecular function and biological process, respectively (50 KB DOC) Click here for additional data file. Table S12 GO Terms Associated with the Top 20 Genes with the Smallest Unpolarized MK Test p-Value (118 KB DOC) Click here for additional data file. Table S13 Genes Showing Excess Protein Polymorphism (p 2) (55 KB DOC) Click here for additional data file. Table S23 Genes Associated with the Most-Significant 3′ UTR Polarized MK Tests (Average Coverage per Site > 2) (52 KB DOC) Click here for additional data file. Table S24 Genes Associated with the Most-Significant Intron MK Tests (Average Coverage per Site > 2) (64 KB DOC) Click here for additional data file. Table S25 Number (Frequency) of Nonsynonymous and Noncoding Polymorphisms (Sites with Coverage of n = 5 or n = 6 D. simulans Alleles) for Different Frequency Classes (40 KB DOC) Click here for additional data file. Table S26 Counts and Substitution Rates per Site of Preferred and Unpreferred Variants “Fixed” along the D. simulans and D. melanogaster Lineages (Inferred by Parsimony) Substitution rates were determined by dividing the number of preferred/unpreferred fixations by the number of unpreferred/preferred ancestral bases. (74 KB DOC) Click here for additional data file. Table S27 X and A, Polymorphic and Fixed, Preferred and Unpreferred Variants for Sites with Coverages Four, Five, or Six (33 KB DOC) Click here for additional data file. Table S28 Unpreferred Polymorphisms (Coverage Five Sites) Occur at Lower Frequency than Preferred Polymorphisms (30 KB DOC) Click here for additional data file. Table S29 Genes with Significant Polarized MK Tests Have a Higher Proportion of Preferred Fixations than Genes with Nonsignificant MK Tests (27 KB DOC) Click here for additional data file. Table S30 Preferred, Unpreferred, and Noncoding GC/AT Fixed Variants across the Genome (Coverage Classes Three–Six) (27 KB DOC) Click here for additional data file. Table S31 Polymorphic GC Variants Occur at Higher Frequency than Polymorphic AT Variants X-linked polymorphic GC variants occur at higher frequency than autosomal polymorphic GC variants (coverage-six polymorphisms from intergenic and intron DNA). (32 KB DOC) Click here for additional data file. Table S32 D. yakuba Genome Input and Assembly Statistics Statistics presented are for the whole-genome assembly before it was anchored using alignments to D. melanogaster. “Contigs” are contiguous sequences not interrupted by gaps, and “supercontigs” are ordered and oriented “contigs” including estimated gap sizes. The N50 statistic is defined as the largest length L such that 50% of all nucleotides are contained in contigs of size at least L. The total contig size was 167 Mb, with 97% of the consensus base pairs having quality scores of at least 40 (Q40) (expected error rate of less than or equal to 10−4) and 98% are at least Q20. (59 KB DOC) Click here for additional data file. Table S33 Read and Trim Statistics for D. simulans Syntenic Assemblies (35 KB DOC) Click here for additional data file. Table S34 Correlation (Kendall's τ) between Copy Numbers of TE Families in “Trimmed” Euchromatic Regions of D. simulans and D. melanogaster The simulans TEs are the “clustered” TEs. The melanogaster TEs are those annotated in release 4.0. (31 KB DOC) Click here for additional data file. Table S35 Tests of the Homogeneity of the Proportions of Each Family across Six D. simulans Lines, Homogeneity of Classes across Lines, and Homogeneity of Families within Classes across Lines (33 KB DOC) Click here for additional data file. Table S36 Test of the Homogeneity of Relative Family Copy Numbers across the Five Chromosome Arms (Pooled across Lines) for All TEs and within the Four Classes (33 KB DOC) Click here for additional data file. Table S37 Test of the Homogeneity of Relative Family Copy Numbers on the X chromosome versus the Autosomes (Pooled across Lines) for All TEs and within the Four Classes (32 KB DOC) Click here for additional data file. Table S38 Heterogeneity of “Cloned” TE Numbers in Various Gene Annotation Elements (29 KB DOC) Click here for additional data file. Table S39 Comparison of Expected D. simulans Nucleotide Heterozygosity and Divergence for 30-kb Windows Centered on the Estimated Position of “Clustered” TEs (+) Compared to Windows without Clustered TEs (–) The difference between the distributions (TEs: +/-) was tested with the Mann-Whitney U test; the p-value is in the upper position in the last column (probability < / ratio). The ratio of the means is also shown (lower in last column). (50 KB DOC) Click here for additional data file. Text S1 Transposable Elements (48 KB DOC) Click here for additional data file. Accession Numbers The GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) accession number for D. yakuba is AAEU01000000 (version 1) and for the D. simulans w501 whole-genome shotgun assembly is TBS-AAEU01000000 (version 1).
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Genomic Analyses of the Microsporidian Nosema ceranae, an Emergent Pathogen of Honey Bees

              Introduction Honey bees, Apis mellifera, face diverse parasite and pathogen challenges against which they direct both individual and societal defenses [1]. Severe honey bee colony losses have occurred in the past several years in the United States, Asia, and Europe. Some of these losses have been attributed to Colony Collapse Disorder (CCD), a sporadic event defined by high local colony mortality, the rapid depopulation of colonies, and the lack of known disease symptoms [2]. While causes of CCD are not yet known, and are likely to be multifactorial, increased pathogen loads in declining bees suggest a role for disease One candidate disease agent is the microsporidian Nosema ceranae, a species that has sharply increased its range in recent years [3]. Microsporidia are a highly derived lineage of fungi that parasitize a diverse assemblage of animals [4]. N. ceranae was first described from colonies of the Asian honey bee, Apis cerana, that were sympatric with A. mellifera colonies in China. Fries et al. [5] suggested that a host switch from A. ceranae to A. mellifera occurred relatively recently. Currently, N. ceranae is the predominant microsporidian parasite of bees in North America [6] and Europe [3]. N. ceranae is an obligate intracellular parasite of adult honey bees. Ingested spores invade the gut epithelium immediately after germination Intracellular meronts eventually lead to mesospores that can invade neighboring cells after host-cell lysis. Ultimately, hardier exospores are passed into the gut and excreted, at which point these exospores are infective to additional hosts. While congener N. apis appears to restrict its life cycle to the gut wall, N. ceranae was recently shown to invade other tissues [7]. Health impacts of Nosema infection on honey bees include a decreased ability to acquire nutrients from the environment and ultimately a shortened lifespan [8]. At the colony level, Nosema infection can lead to poor colony growth and poor winter survivorship. Nevertheless, N. ceranae is widespread in both healthy and declining honey bee colonies and its overall contribution to honey bee losses is debatable [9],[10],[11]. Genetic studies of N. ceranae and its infected host have been hindered by a lack of genetic data. Prior to this study, microsporidian sequence data were most extensive for the mammalian pathogen Encephalitozoon cuniculi, complemented by genome or EST surveys of several other human and insect pathogens [4],[12],[13],[14] and a recent draft annotation of the mammalian pathogen Enterocytozoon bieneusi [15]. Public sequences for N. ceranae were limited to ribosomal RNA loci. We therefore chose pyrosequencing to rapidly and cost-effectively characterize the N. ceranae genome, simultaneously illuminating the ecology and evolution of this parasite while enabling focused studies of virulence mechanisms and population dynamics. A genomic approach also leverages existing microsporidian and fungal genome sequence, advancing through comparative analysis our understanding of how microsporidian genome architecture and regulation has evolved. Microsporidia are remarkable in having small genomes that overlap prokaryotes in size, a propensity for overlapping genes and transcripts, few introns, and predicted gene complements less than half that found for yeast [16],[17],[18]. Microsporidian cells are also simplified at the organellar level and lack mitochondria, instead containing a genome-less organelle, the mitosome, that appears incapable of oxidative phosphorylation but may function in iron-sulfur biochemistry [19],[20]. Biochemical studies and sequence analyses have identified novel features of carbon metabolism and a dependency on host ATP, but much of their metabolism remains unclear [17],[19] and major metabolic pathways can differ substantially among species [15]. Here we analyze a draft genome assembly for N. ceranae, present a gene set of 2,614 putative proteins that can now be used to uncover salient aspects of Nosema pathology, and describe gene families and ontological groups that are distinct relative to other sequenced fungi. We provide formatted annotations for viewing with the Gbrowse genome viewer [21], which we hope will aid future studies of this economically important pathogen and of microsporidia in general. Materials and Methods Nosema ceranae spore purification Honey bees infected with N. ceranae were collected from the USDA-ARS Bee Research Laboratory apiaries, Beltsville, MD. Alimentary tracts of these bees were removed and crushed in sterile water and filtered through a Corning (Lowell, MA) Netwell insert (24 mm diameter, 74 µm mesh size) to remove tissue debris. The filtered suspension was centrifuged at 3,000×g for 5 minutes and the supernatant discarded. The re-suspended pellet was further purified on a discontinuous Percoll (Sigma-Aldrich, St. Louis, MO) gradient consisting of 5 ml each of 25%, 50%, 75% and 100% Percoll solution. The spore suspension was overlaid onto the gradient and centrifuged at 8,000×g for 10 minutes at 4°C. The supernatant was discarded and the spore pellet was washed by centrifugation and suspension in distilled sterile water. Genomic DNA extraction Approximately 106 N. ceranae spores were suspended in 500 µl CTAB buffer (100 mM Tris-HCl, pH 8.0; 20 mM EDTA, pH 8.0; 1.4 M sodium chloride; 2% cetyltrimethylammonium bromide, w/v; 0.2% 2-mercaptoethanol) and broken by adding 500 µg of glass beads (425–600 µm, Sigma-Aldrich, St. Louis, MO) into the tube and disrupting the mixture at maximum speed for 2–3 minutes using a FastPrep Cell Disrupter (Qbiogene, Carlsbad, CA). The mixture was then incubated with proteinase K (200 µg/ml) for five hours at 55°C. Genomic DNA was extracted in an equal volume of phenol/chloroform/isoamyl alcohol (25∶24∶1) twice, followed by a single extraction in chloroform. The purified DNA was precipitated with isopropanol, washed in 70% ethanol, and dissolved in 50 µl sterile water. The concentration and purity of the DNA were determined by spectrophotometric absorption at 260 nm, and ratios of absorption at 260 nm and 280 nm. Sequencing and assembly Extracted DNA was pooled, sheared, and processed using in-house protocols at 454 Life Sciences (Branford, CT). The template was then amplified by two separate runs of 32 emulsion-PCR reactions each, with each reaction comprised of templates containing 454-linker sequence attached to 600,000 sepharose beads [22]. Successful amplifications were sequenced using GS FLX picotiter plates and reads were trimmed of low-quality sequence before assembly with the Celera Assembler package CABOG [23]. Gene predictions Gene predictions were merged from three distinct sources. We first used the Glimmer package [24], which is designed for predicting exons of prokaryote and small eukaryote genomes, using a hidden Markov model to evaluate the protein-coding potential of ORFs. The model was initially trained on ORFs identified by Glimmer's longorf program, and then run with the following parameters: a minimum length of 90 codons, a maximum overlap of 50 bp, and a threshold score of 30. ORFs that contained a high proportion of tandem sequence repeats were ignored. Secondly, we identified all additional ORFs not predicted to be protein-coding by Glimmer that were BLASTX [25] matches to GenBank fungal proteins (at a lax expectation threshold of 1.0E-5). Finally, all remaining ORFs were searched with the HMMER program (http://hmmer.janelia.org) for Pfam-annotated protein domains [26] using an expectation threshold of 1.0×10E-1. In 58 cases, adjacent ORFs matching different parts of the same GenBank protein or Pfam domain could be joined by hypothesizing a single-base frameshift error in the assembly. Our annotations span the start and stop codons of these conjoined ORFs and indicate the approximate site of the frameshift with the ambiguity characters N and X, respectively, in the nucleotide and protein sequence. tRNA genes were predicted with the program ARAGORN [27]. Ribosomal genes were identified by BLASTN searches and alignments with existing Nosema ribosomal sequence in GenBank and the SILVA ribosomal database [28]. Nucleotide composition of protein-coding genes was investigated with the program INCA2.0 [29]. Protein homology searches and functional annotation We identified probable one-to-one orthologs among these three genomes using reciprocal best BLASTP matches, with the additional requirements that the best match have an expectation ≤1.0E-10 and 103 lower than the second best match (identical protein predictions in E. cuniculi were considered equivalent). Best-fit homologs in yeast, as determined by BLASTP with a minimum expectation of 1.0E-10, were used to annotate N. ceranae genes with GO Slim ontologies [30]. Signal peptides were predicted using the SignalP 3.0 program [31] and transmembrane domains were predicted with TMHMM 2.0 [32]. Assignments to conserved positions in metabolic and regulatory pathways were based on the KEGG annotation resource [33], assisted by the Blast2Go program [34]. Repetitive elements were identified by searching against Repbase [35], by the pattern searching algorithm REPuter [36], and by intragenomic BLASTN analyses. Results Sequencing and assembly Sequence information and annotations are posted in Genbank (www.ncbi.nlm.nih.gov) under Genome Project ID 32973. High-quality reads from two 454 GS FLX sequencing runs contributed 275.8 MB for assembly. The assembly was complicated by an extreme AT bias, frequent homopolymer runs (which are prone to sequencing error), and numerous repetitive elements (see below). Sixty-one independent assemblies were evaluated by systematically increasing the error parameter from zero to 6% in 0.1% increments. The final assembly used an error rate of 3.5% because this maximized both the N50 of contig size and the length of the longest contig. To search for potential mis-assembly, we compared this version to other assemblies using MUMmer [37]. We identified two contigs that likely contained collapsed repeats and replaced these with alternative versions assembled with a stricter error parameter. Other parameters of the assembly remained at their default CABOG settings. Sequencing and assembly statistics are summarized in Table 1 . 10.1371/journal.ppat.1000466.t001 Table 1 Statistics of draft N. ceranae genome assembly analyzed in this paper. Stage Category Value* Sequencing High-quality reads 1,063,650 High-quality bases 275,848,411 Average high-quality read length 259.3 Average high-quality read quality score 30.4 Assembly Number of retained contigs 5,465 Range of contig length (bp) 500–65,607 Sum of contigs 7,860,219 Contig N50 length 2,902 Contig N50 number 470 Average contig coverage 24.2 * Sequence lengths in base pairs. Accidental incorporation of non-target DNA sequence into genome assemblies is a ubiquitous hazard even with stringent sample preparation. We therefore used BLAST, depth of coverage, and G+C content as criteria to help identify potential contamination, but found no evidence of sequence derived from the host genome (A. mellifera), the sympatric congener N. apis, or another common fungal pathogen of bees, Ascosphaera apis. However, we did find evidence for low-level contamination by an unknown ascomycete fungus, indicated by generally short, low-coverage, high-GC contigs with consistently stronger BLASTX matches to Ascomycota than to Microsporidia. We therefore removed all contigs with less than five-fold coverage and a G+C content of 0.5 or greater (see Fig. S1 ), as well as any contig that matched ascomycete ribosomal or mitochondrial sequence. After purging these suspect contigs and removing all contigs less than 500 bp in length, there remained 5465 contigs that totaled 7.86 MB of DNA. The N50 contig size of the pruned assembly was 2.9 kb (i.e., half of the total assembly, or 3.93 MB, was in contigs greater than 2.9 kb). The mean sequence coverage of contigs was 24.2×. Using the GigaBayes suite of programs [38],[39], we estimated the frequency of simple polymorphisms (indel or nucleotide, P≥0.90 per site) on the 100 longest contigs to be 1.0 per kilobase. Genomic G+C content of the final contig set was low compared with E. cuniculi, 26% vs. 47%, but typical of other surveyed microsporidia. Genomic contigs of Enterocytozoon bieneusi in GenBank have a G+C content of 24%, and Williams et al. [13] reported genomic G+C contents of Brachiola algerae and Edhazardia aedis to be 24% and 25%, respectively. Although several factors potentially associated with microbial base composition have been investigated, such as ambient temperature, mutation bias, and selection on genome replication rates, the causes of compositional bias remain unclear (see, for example, [40] and references cited therein). Because genome assemblies may not accurately represent true genome size, due to such factors as redundancy at contig ends or collapsed repeats, we applied the method of Carlton et al. [41] to estimate genome size from sequence coverage, excluding repeats. We first classified all 22-mers occurring in the read sequence not more than 40 times as the unique portion of the genome. Using this filter, the average coverage was 26.6× and 28.2× for regions of at least 1 kb and 10 kb in length, respectively. The total length of the N. ceranae reads is 261.0 MB after filtering reads with G+C content higher than 50%. With these values, the total genome size could be as high as 9.8 MB. However, this G+C filter may be overly permissive; increasing the filter stringency to 35% G+C reduces the genome size estimate to 8.6 MB. An additional consideration is that, at the estimated level of coverage, we expect the entire genome to be sequenced with few singletons or small contigs. Yet 30.0 MB of read sequence assembled into contigs with 10 or fewer reads, including 5.5 MB of single-read contigs. These small contigs are likely to be from reads with relatively high sequencing error. If so, this would boost the average coverage of the assembly by 3×–3.5× and reduce the genome size to as low as 7.7 MB. Our attempts to measure the genome size empirically with pulse-field gel electrophoresis did not adequately resolve N. ceranae chromosomes. However, this technique in other Nosema species has yielded genome size estimates of 7.4–15 MB [42]. Thus, while our computational estimate is in reasonable agreement with current genome size estimates for the genus, an unknown but potentially significant portion of the genome may be unrepresented in this assembly and the absence of particular sequences should not be considered definitive. Sequence repetition The genome sequence of E. cuniculi revealed an unusual distribution of sequence repeats, characterized by a lack of known transposable elements, a paucity of simple repeats, and an abundance of near-perfect segmental duplications of 0.5–10 kb in length. Pulse-field gel electrophoretic studies have identified gross variation in the size of homologous chromosomes among and within isolates of E. cuniculi [43] and the microsporidian Paranosema grylli [44], indicating that large segmental duplications are potentially important sources of intraspecific variation. The origins and gene content of such duplications are therefore of particular interest. While the present assembly limits our ability to describe larger segmental duplications in N. ceranae, we were able to investigate sequence repetition in the genome by searching for microsatellite motifs and by using REPuter [36] to detect complex repeats. All eight dinucleotide repeats found were ‘AT’ repeats, ranging from a perfect 9-unit repeat to an imperfect (3 mismatches) 21-unit repeat. There were six AAT repeats greater than 6 units in length and four ATC repeats. We confined our search for complex repeats to those contigs greater than 1,200 bp in length, so as to identify repeats likely to be dispersed in the genome rather than confined to the most poorly assembled fragments. REPuter identified a total of 4,731 sequence pairs with at most three mismatches that ranged from 70 bp (the minimum threshold for detection) up to 312 bp in length with a median of 85 bp. Repeats were over-represented on smaller contigs, even within the analyzed set of relatively long contigs, indicating that they had affected assembly success. BLASTN analyses of the REPuter-identified repeats against the N. ceranae genome revealed a novel dispersed repeat with a conserved core domain approximately 700 bp in length ( Fig. S2 ). The boundaries of the element are not completely clear because the conserved domain often occurs as tandem copies, there are two or more subtypes of the element based on multiple sequence alignments, and, as expected, copies are most abundant on short contigs and near contig ends. Using an E-value cutoff of 1.0E-5, we identified one or more matches on 250 contigs. No conserved coding potential was evident for these elements, nor did we detect any homology with sequences in GenBank or Repbase. Surprisingly, this element contains a candidate polII promoter that is well conserved and generally scores between 0.90 and 1.00 (the maximum value) when submitted to a neural network prediction tool [45]. Whether this promoter-like motif is functional and, if so, whether it produces a coding or noncoding transcript remain to be seen. However, it is clear from BLAST searches that this promoter sequence is not associated with any of our predicted genes (see below), nor could we identify it in E. cuniculi or yeast. Predicted genes and associated features We identified 2,614 putative protein-coding genes, with reference names, coordinates, and annotation features provided in Text S1 . Gene models were not required to have a start methionine to allow for gene predictions truncated at ends of contigs and (rarely) the possibility of non-canonical start codons or frameshifts in the assembly. In addition to BLAST-hit annotations, Text S1 also lists Pfam protein domains as well as signal peptide and transmembrane motifs. Texts S2 and S3 , respectively, contain GFF-formatted data and a configuration file for viewing our annotations with the Gbrowse viewer [21]. An example of these annotations viewed in GBrowse is shown in Fig. S3 . The number of protein-coding genes we have predicted for N. ceranae lies in between the 1,996 Refseq proteins given by GenBank for the sequenced E. cuniculi genome and the 3,804 predicted for E. bieneusi from sequence representing only two-thirds of the estimated genome content. The density of genes on the 100 largest N. ceranae contigs averaged 0.60 genes/kb (64.8% coding sequence). This is a lower proportion of coding sequence than found in E. cuniculi and Antonospora locustae (0.94 and 0.97 genes/kb, respectively [4]), but comparable to some other microsporidia [13]. However, gene density declines considerably with contig size ( Fig. S4 ), consistent with a preponderance of repetitive elements (described above and to follow) or other noncoding sequence in these regions. We found forty-six contigs containing sequences that matched N. ceranae ribosomal sequence at an expectation of E 1000 bp, 500–1000 bp, and <500 bp, left to right. Note wide range of mean coverage, even for large contigs. (2.13 MB TIF) Click here for additional data file. Figure S2 Partial sequence alignment of copies of a novel dispersed repeat found on 250 contigs using conservative BLAST criteria. The conserved sequence includes a candidate polII promoter but no long ORF. (7.68 MB TIF) Click here for additional data file. Figure S3 Screenshots of annotated N. ceranae assembly viewed with the Gbrowse application. (4.23 MB TIF) Click here for additional data file. Figure S4 Table illustrating the progressive decline in gene density as contig size decreases. (0.88 MB TIF) Click here for additional data file. Figure S5 The 65 N. ceranae tRNA genes predicted by ARAGORN [27], ordered by the corresponding amino-acid. (1.65 MB TIF) Click here for additional data file. Figure S6 Sense-strand matches to the yeast TATA motif, TATA[AT]A[AT], in the 200-bp region upstream of high-confidence start codons. The vertical axis shows the proportion of all matches upstream of the sampled genes (n = 280, see text) that begin at the specified distance from the start codon. There is a pronounced spike in TATA box motifs occurring in the vicinity of the −27 position relative to their frequency in random sequence of the same base composition. (1.42 MB TIF) Click here for additional data file. Figure S7 Codon usage of N. ceranae (red) and E. cuniculi (green) genes, plotted using INCA [29]. Each bar represents the proportion of all codons encoding a given amino-acid that are the specified codon. Thus, the values are one by definition for the single-codon amino-acids, tryptophan and methionine. (5.78 MB TIF) Click here for additional data file. Figure S8 Codon bias of genes of three microsporidian genomes. Only N. ceranae genes with homology to genes in E. cuniculi are plotted. Vertical axis is ENC' (Novembre JA [2002] Accounting for background nucleotide composition when measuring codon usage bias. Mol Biol Evol 19: 1390–1394), a measure of codon bias adjusted for nucleotide composition, plotted versus third-position G+C (GC3). Few N. ceranae genes have an ENC' less than 50. Those that do are not obviously related, by homology or ontology, to comparably biased genes in the other two species. Thus, strong codon bias may not be a useful predictor of gene-expression level in microsporidia as it is in a variety of other microbes. (1.22 MB TIF) Click here for additional data file. Figure S9 Frequency of each amino-acid, indicated by single-letter codes, in predicted proteins of N. ceranae and E. cuniculi. A. The frequency of each amino-acid in those genes that have one-to-one orthologs in the other microsporidian genomes and yeast. The conservation of these genes suggests that they have essential and ancient functions. B. The frequency of each amino-acid in all predicted proteins of the indicated species. (1.93 MB TIF) Click here for additional data file. Figure S10 Characteristics of ‘microsporidian-specific’ genes, orthologous pairs of genes found in N. ceranae and E. cuniculi that lack apparent homology with proteins of taxa outside of order Microsporidia. (1.12 MB TIF) Click here for additional data file. Figure S11 Amino-acid compositions of the putative polar tube proteins PTP1 and PTP2 in N. ceranae and two other microsporidians, E. cuniculi and A. locustae. Lengths of predicted proteins, in amino acid residues, is given in parentheses. (1.86 MB TIF) Click here for additional data file. Figure S12 Degree of synteny between the N. ceranae contigs and E. cuniculi chromosomes. For each of the three largest contigs, predicted N. ceranae genes are shown in order along the contig (not to scale). The relative orientation of each gene is indicated by the arrow. N. ceranae genes shaded gray have one-to-one orthologs with E. cuniculi genes, whereas circled genes have homologs in E. cuniculi but not a one-to-one ortholog. Unmarked genes have no detected homolog in E. cuniculi (see text). The position in kilobases and relative orientation of the E. cuniculi ortholog is shown directly below the N. ceranae gene in the row corresponding to its chromosomal location. Coordinates are based on the GenBank record for each chromosome. These contigs contain regions of extensive, coarse-scale synteny with E. cuniculi, within which there can be considerable change in gene order or orientation. There are also numerous breaks in synteny associated with either a switch in E. cuniculi chromosome or an intervening, non-homologous gene. (2.06 MB TIF) Click here for additional data file. Figure S13 Relative sequence conservation between N. ceranae proteins and their homologs in other reference species. N. ceranae genes with one-to-one orthologs in E. cuniculi and yeast (see text) were BLASTP searched against the combined proteomes of E. cuniculi, E. bieneusi, and yeast. The number of N. ceranae genes with high-scoring matches in all three reference species in this data set was 234. The upper panel plots the BLASTP score of each N. ceranae gene versus the best match in each species, ordered along the X-axis by descending score versus E. cuniculi. Values are represented as lines rather than points for easier visualization. The lower panel plots the BLASTP expectation (E-value) in ascending order versus E. cuniculi. E-values equal to zero were set to 1.0E-200 to allow a logarithmic scale. (2.63 MB TIF) Click here for additional data file. Figure S14 Alignment of adjacent N. ceranae genes that are supported by homology (see Results) and that overlap in sequence. Each set of three sequences represents a contig and two adjacent genes. Start and stop codons are indicated by boxes. (2.37 MB TIF) Click here for additional data file.
                Bookmark

                Author and article information

                Journal
                Nat Commun
                Nat Commun
                Nature Communications
                Nature Pub. Group
                2041-1723
                21 September 2010
                : 1
                : 77
                Affiliations
                [1 ]Department of Botany, Canadian Institute for Advanced Research, University of British Columbia, 3529-6270 University Boulevard , Vancouver, British Columbia , Canada V6T 1Z4.
                [2 ]FASTERIS S.A., Ch. du Pont-du-Centenaire 109 , PO Box 28, CH-1228 Plan-les-Ouates, Switzerland.
                [3 ]Tulane National Primate Research Center, Tulane University , 18703 Three Rivers Road, Covington, Louisiana 70433, USA.
                Author notes
                [*]

                These authors contributed equally to this work.

                [†]

                Present address: Department of Biology, Canadian Institute for Advanced Research, University of Ottawa, Gendron Hall, Ottawa, Ontario, Canada K1N 6N5.

                Article
                ncomms1082
                10.1038/ncomms1082
                4355639
                20865802
                70a3e3da-8c25-4a70-bf10-9bdf1553a7fd
                Copyright © 2010, Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.

                This work is licensed under a Creative Commons Attribution-NonCommercial-ShareALike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/

                History
                : 01 June 2010
                : 25 August 2010
                Categories
                Article

                Uncategorized
                Uncategorized

                Comments

                Comment on this article