95
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Reassessment of the Listeria monocytogenes pan-genome reveals dynamic integration hotspots and mobile genetic elements as major components of the accessory genome

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Listeria monocytogenes is an important food-borne pathogen and model organism for host-pathogen interaction, thus representing an invaluable target considering research on the forces governing the evolution of such microbes. The diversity of this species has not been exhaustively explored yet, as previous efforts have focused on analyses of serotypes primarily implicated in human listeriosis. We conducted complete genome sequencing of 11 strains employing 454 GS FLX technology, thereby achieving full coverage of all serotypes including the first complete strains of serotypes 1/2b, 3c, 3b, 4c, 4d, and 4e. These were comparatively analyzed in conjunction with publicly available data and assessed for pathogenicity in the Galleria mellonella insect model.

          Results

          The species pan-genome of L. monocytogenes is highly stable but open, suggesting an ability to adapt to new niches by generating or including new genetic information. The majority of gene-scale differences represented by the accessory genome resulted from nine hyper variable hotspots, a similar number of different prophages, three transposons (Tn916, Tn554, IS3-like), and two mobilizable islands. Only a subset of strains showed CRISPR/Cas bacteriophage resistance systems of different subtypes, suggesting a supplementary function in maintenance of chromosomal stability. Multiple phylogenetic branches of the genus Listeria imply long common histories of strains of each lineage as revealed by a SNP-based core genome tree highlighting the impact of small mutations for the evolution of species L. monocytogenes. Frequent loss or truncation of genes described to be vital for virulence or pathogenicity was confirmed as a recurring pattern, especially for strains belonging to lineages III and II. New candidate genes implicated in virulence function were predicted based on functional domains and phylogenetic distribution. A comparative analysis of small regulatory RNA candidates supports observations of a differential distribution of trans-encoded RNA, hinting at a diverse range of adaptations and regulatory impact.

          Conclusions

          This study determined commonly occurring hyper variable hotspots and mobile elements as primary effectors of quantitative gene-scale evolution of species L. monocytogenes, while gene decay and SNPs seem to represent major factors influencing long-term evolution. The discovery of common and disparately distributed genes considering lineages, serogroups, serotypes and strains of species L. monocytogenes will assist in diagnostic, phylogenetic and functional research, supported by the comparative genomic GECO-LisDB analysis server ( http://bioinfo.mikrobio.med.uni-giessen.de/geco2lisdb).

          Related collections

          Most cited references90

          • Record: found
          • Abstract: found
          • Article: not found

          Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome".

          The development of efficient and inexpensive genome sequencing methods has revolutionized the study of human bacterial pathogens and improved vaccine design. Unfortunately, the sequence of a single genome does not reflect how genetic variability drives pathogenesis within a bacterial species and also limits genome-wide screens for vaccine candidates or for antimicrobial targets. We have generated the genomic sequence of six strains representing the five major disease-causing serotypes of Streptococcus agalactiae, the main cause of neonatal infection in humans. Analysis of these genomes and those available in databases showed that the S. agalactiae species can be described by a pan-genome consisting of a core genome shared by all isolates, accounting for approximately 80% of any single genome, plus a dispensable genome consisting of partially shared and strain-specific genes. Mathematical extrapolation of the data suggests that the gene reservoir available for inclusion in the S. agalactiae pan-genome is vast and that unique genes will continue to be identified even after sequencing hundreds of genomes.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths

            Introduction Escherichia coli was brought into laboratories almost a century ago to become one of the most important model organisms and by far the best-studied prokaryote. Major findings in phage genetics, bacterial conjugation, recombination, genetic regulation and chromosome replication involved the use of E. coli, especially laboratory derivatives of the K-12 strain, originally isolated from the faeces of a convalescent diphtheria patient in Palo Alto in 1922 [1]. However, K-12 derivatives are far from representing the whole E. coli species [2]. The primary habitat of E. coli is the lower intestinal tract of humans and other vertebrates, with which it typically establishes commensal associations. Healthy humans typically carry more than a billion E. coli cells in their intestine. It has been estimated that half of the living E. coli cells are outside their host, in their secondary habitat [3]. Beside these habitats, certain strains have the potential to cause a wide spectrum of intestinal and extra-intestinal diseases such as urinary tract infection, septicaemia, meningitis, and pneumonia in humans and animals [4]. Furthermore, Shigella, which have been elevated to the genus order with four species (dysenteriae, flexneri, boydii, sonnei) based on their capacity to generate a specific mucosal invasive diarrhoea strictly in humans and their biochemical characteristics, in fact belong to the E. coli species [5]–[7]. Of note, Shigella and enteroinvasive E. coli are considered the only obligate pathogens of the species, whereas other strains are facultative pathogens with a broad host range. Thus, natural isolates of E. coli/Shigella live in conditions quite different from those in the laboratory and must cope with very diverse environments that provide stresses ranging from immune system attack and protozoal grazing to starvation, low temperatures, and, more recently, antibiotic therapy. With its large range of pathologies, E. coli is a major cause of human morbidity and mortality around the world. Each year E. coli causes more than two million deaths due to infant diarrhoea [8],[9] and extraintestinal infections (mainly septicaemia derived from urinary tract infection) [10], and is also responsible for approximately 150 million cases of uncomplicated cystitis [10]. Since humans and food animals carry so many E. coli cells that may establish commensal or antagonistic interactions with their hosts it is mandatory to define the genetic and population determinants that drive commensal strains to adopt a pathogenic behaviour. Population genetic studies based on both multi-locus enzyme electrophoresis [11]–[13] and various DNA markers [14]–[18] have identified four major phylogenetic groups (A, B1, D and B2) and a potential fifth group (E) among E. coli strains. Strains of these groups differ in their phenotypic characteristics, including the ability to use certain sugars, antibiotic resistance profiles and growth rate–temperature relationships [19]. The distribution (presence/absence) of a range of virulence factors thought to be involved in the ability of a strain to cause diverse diseases also varies among strains of these phylogenetic groups [20]–[22], indicating a role of the genetic background in the expression of virulence [23]. Consequently, these groups are differently associated with certain ecological niches, life-history characteristics and propensity to cause disease. For example, group B2 and D strains are less frequently isolated from the environment [24], but more frequently recovered from extra-intestinal body sites [23]. While B2 strains represent 30 to 50% of the strains isolated from the faeces of healthy humans living in industrialised countries, they account for less than 5% in French Guyana Amerindians [25]–[26]. The clear clustering of E. coli strains into monophyletically meaningful groups has long been used as an argument favouring clonality within the species. However, analysis of gene sequences shows pervasive recombination, matching the well-known efficiency of conjugation and transduction of the species [17],[27]. Hence, it remains controversial whether such frequent recombination obliterates the phylogenetic signal. E. coli genomes show evidence of widespread acquisition of functions by horizontal gene transfer, concomitant with similar amounts of gene deletion [28]–[29]. While less than 3% of nucleotide divergence is found among conserved genes, the gene content between pairs of E. coli genomes may diverge by more than 30% [30]. Such diversification of gene content due to horizontal gene transfer contributes greatly to the diversity of the strains' phenotypes and can be accurately quantified only by the sequencing of a large number of strains to completion and closure. Until now, sequencing efforts in E. coli have been focused mainly on pathogenic strains, particularly on diarrhoeal and group B2 extraintestinal pathogenic strains (see Table 1), precluding an unbiased assessment of the diversity of the species. Therefore, we have sequenced with high coverage and up to completion the genomes of 6 human-source E. coli strains. The E. coli strains were chosen to complement the available sequences and other ongoing sequencing projects (http://msc.jcvi.org/e_coli_and_shigella/index.shtml, http://www.sanger.ac.uk/Projects/Escherichia_Shigella/). They encompass two commensal strains of phylogenetic groups B1 and B2, a group B1 enteroaggregative strain, two group D urinary tract infection strains and a group B2 newborn meningitis strain (Table 1). We also sequenced the type strain of the closest E. coli relative, i.e., E. fergusonii [31], as an outgroup to permit accurate and meaningful evolutionary analyses with the 6 new E. coli genomes and the 14 other currently available E. coli/Shigella genomes. To statistically substantiate the identification of extraintestinal virulence-associated genes, we also applied a mouse lethality assay to the strains [32] to quantify the intrinsic virulence of the strain, excluding host variability and other potential confounding factors (Table 1). Our goal was to take the outstanding opportunity provided by the availability of many genomes of a single bacterial species, regarding which a considerable amount of knowledge has been accumulated over the years, to answer to the following questions. (i) Is there genome-wide evidence of frequent recombination and does it vary with genome location? (ii) If so, can one nonetheless infer an intra-specific bacterial phylogeny? (iii) How do the different factors of genome dynamics (mutation, horizontal gene transfer with or without recombination) result together in strain diversification? (iv) Is genome dynamics in conflict with genome organisation? (v) How does the commensalism/pathogenicity duality evolve? 10.1371/journal.pgen.1000344.t001 Table 1 Principal characteristics of the 20 Escherichia coli/Shigella strains and 1 E. fergusonii strain. Strains Host Sample Clinical condition (Pathotypea) Phylogenetic groupb Extraintestinal mouse model phenotypec (Number of mice killed out of 10) Genome sequence reference K-12 MG1655 Human Faeces Commensal A NK (0) [115] K- 12 W3110 Human Faeces Commensal A NK (0) Nara Institute of Science and Technology IAI1 Human Faeces Commensal B1 NK (0) This work 55989 Human Faeces Diarrhoea (EAEC) B1 K (10) This work S. boydii 4 227 (Sb 227) Human Faeces Shigellosis S1 NDd [116] S. sonnei 046 (Ss 046) Human Faeces Shigellosis SS ND [116] S. flexneri 2a 301 (Sf 301) Human Faeces Shigellosis S3 ND [117] S. flexneri 2a 2457T (Sf 2457T) Human Faeces Shigellosis S3 NK (0) [118] S. flexneri 5b 8401 (Sf 8401) Human Faeces Shigellosis S3 ND [119] S. dysenteriae 1 197 (Sd 197) Human Faeces Shigellosis SD1 ND [116] O157:H7 EDL933 Human Faeces Diarrhoea (EHEC) E NK (1) [120] O157:H7 Sakai Human Faeces Diarrhoea (EHEC) E NK (1) [121] UMN026 Human Urine Cystitis (ExPEC) D K (10) This work IAI39 Human Urine Pyeloneprhitis (ExPEC) D K (8) This work UTI89 Human Urine Cystitis (ExPEC) B2 K (10) [122] APEC O1 Chicken Lung Colisepticemia (ExPEC) B2 K (10) [123] S88 Human Cerebro-spinal fluid New born meningitis (ExPEC) B2 K (10) This work CFT073 Human Blood Pyeloneprhitis (ExPEC) B2 K (10) [30] ED1A Human Faeces Healthy subject B2 NK (0) This work 536 Human Urine Pyeloneprhitis (ExPEC) B2 K (10) [124] E. fergusonii Human Faeces Unknown Outgroup NK (1) This work The strains in bold correspond to the strains sequenced in this work. a EAEC (enteroaggregative E. coli), EHEC (enterohaemorrhagic E. coli), ExPEC (extraintestinal pathogenic E. coli). b The E. coli and Shigella phylogenetic groups are as defined in [22] and [6], respectively. c K, killer; NK, Non Killer [32]. d ND, not determined. Results/Discussion The General Features of the Seven Sequenced Genomes We fully sequenced the chromosomes and the plasmids, if any, of 6 strains of E. coli and the reference type strain of E. fergusonii. The general features of these replicons are listed in Tables 2 and 3. Genomes were sequenced at an average of 12-fold coverage and were then finished. The 6 newly sequenced E. coli chromosomes contain between 4.7 Mb and 5.2 Mb each, corresponding to between 4627 and 5129 protein coding genes, slightly above the average value within the 20 genomes that we analyzed (∼4700 genes, ranging from 4068 to 5379). The chromosome of E. fergusonii is slightly smaller with ∼4.6 Mb and ∼4500 protein coding genes. The G+C content is very similar among the 6 strains and close to the E. coli K-12 MG1655 value (∼50.8%). The G+C content of E. fergusonii is lower at 49.9%. These chromosomes have similar densities of coding genes and numbers of stable RNA genes. By contrast, the number of pseudogenes varies more widely, from 22 in E. fergusonii to 95 in strain ED1a (Table 2). The list of pseudogenes is available in Table S1. 10.1371/journal.pgen.1000344.t002 Table 2 General features of the Escherichia coli and E. fergusonii genomes sequenced in this work with E. coli K-12 MG1655 as reference (chromosome features). Chromosome features E. coli K-12 MG1655 E. coli strains E. fergusonii ATCC 55989 IAI1 ED1a S88 IAI39 UMN026 Genome Size (bp) 4 639 675 5 154 862 4 700 560 5 209 548 5 032 268 5 132 068 5 202 090 4 588 711 G+C content (%) 50.8 50.7 50.8 50.7 50.7 50.6 50.7 49.9 rRNA operons 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) tRNA genes 86 94 86 91 91 88 88 87 Total Protein-coding genesa 4306 4969 4491 5129 4859 4906 4918 4336 Pseudogenesb (nb) 81 79 51 95 90 80 45 22 Protein coding densityc 85.7 87.4 87.6 86.2 87 86.1 87.8 84.7 Assigned functiond (%) 80 74 77 74 77 78 76.5 77 Conserved hypothetical (%) 12.5 23 21.5 23 22 20 22 20 Orphans (%) 7.5 3 1.5 3 1 2 1.5 3 IS-like genes (nb) 66 150 42 118 47 224 92 29 Phage-associated genes (nb) 231 406 201 657 507 393 429 235 a The number of protein-coding genes is given without the number of coding sequences annotated as artifactual genes (Supplementary Table 2A). b The number of pseudogenes computed for each genome corresponds to the real number of genes that are pseudogenes: one pseudogene can be made of only one CDS (in this case the gene is partial compared to the wild type form in other E. coli strains) or of several CDSs (generally two or three CDSs corresponding to the different fragments of the wild type form in other E. coli strains). These lists of pseudogenes are available in Supplementary Table 1. c The computed protein coding density takes into account the total length of protein genes excluding overlaps between genes, artifacts, and RNA genes. d Protein genes with assigned function include the total number of definitive and putative functional assignments. 10.1371/journal.pgen.1000344.t003 Table 3 General features of the Escherichia coli and E. fergusonii genomes sequenced in this work with E. coli K-12 MG1655 as reference (plasmid features). Plasmid features E. coli strains E. fergusonii ATCC 55989 ED1a S88 UMN026 Genome Size (bp) 72 482 119 594 133 853 122 301 33 809 55 150 G+C content (%) 46.1 49.2 49.3 50.5 42 48.5 Total Protein-coding genesa 100 150 144 149 49 54 Pseudogenesb (nb) 7 11 9 8 0 5 Protein coding densityc 75.6 86.2 87 79.4 87.5 88.7 Assigned functiond (%) 74 53 65 65.7 35.4 46.6 Orphans (%) 17 31.5 25.8 27.8 12.5 20.7 Hypothetical (%) 9 15.5 9.2 6.5 52.2 32.7 IS-like genes (nb) 18 14 14 15 0 4 a The number of protein-coding genes is given without the number of coding sequences annotated as artifactual genes (Supplementary Table 2A). b The number of pseudogenes computed for each genome corresponds to the real number of genes that are pseudogenes: one pseudogene can be made of only one CDS (in this case the gene is partial compared to the wild type form in other E. coli strains) or of several CDSs (generally two or three CDSs corresponding to the different fragments of the wild type form in other E. coli strains). These lists of pseudogenes are available in Supplementary Table 1. c The computed protein coding density takes into account the total length of protein genes excluding overlaps between genes, artifacts, and RNA genes. d Protein genes with assigned function include the total number of definitive and putative functional assignments. The variation in the number of pseudogenes is uncorrelated with the number of transposable elements and phage-associated genes, which vary in the range 42–224 and 201–517 respectively. While some phage-associated genes are scattered throughout the chromosomes, the majority are concentrated in well-defined prophage regions. Analyses of the prophages suggest that many may still be functional. These prophages often carry at their extremity some unrelated cargo genes that probably arose from genomes of previously infected bacteria, as found in Salmonella [33]. We sequenced a total of 6 plasmids, varying in size from 34 to 134 kbp: four strains possess one plasmid each whereas one strain has 2 plasmids (Table 3). As frequently noted, the plasmids have a lower gene density (84%, vs. 87% for chromosomes), lower G+C content (47.4%, vs. 50.7% for chromosomes) and more pseudogenes (2.7%, vs. 1.5% for chromosomes). The percentage of orphan proteins (i.e., having no detectable homolog in other organisms) is also high on plasmids (6.5 to 52.2%), while it ranges between 1–3% on the chromosomes. A manual expert annotation of the new E. coli strains was performed on genes and regions not found in E. coli K-12 MG1655 (about 10 000 genes in total; Table S2A). This allowed the re-annotation of orthologs in the previously available Escherichia and Shigella genomes (see Materials and Methods). The annotation data, together with the results of the comparative analysis were stored in a relational database called ColiScope, which is publicly available using the MaGe Web-based interface at http://www.genoscope.cns.fr/agc/mage. This re-annotation process revealed extensive variations in the number of the newly predicted genes (Table S2B). For example, between the two strains of E. coli O157:H7 we found twice as many newly predicted genes in one strain as in the other. In some genomes important genes were missing. For example, in E. coli APEC O1 several subunits of the ribosome, DNA polymerase III, and ATP synthase were missing in the original annotation (Table S3, E. coli APEC sheet). In other genomes, the re-annotation allowed us to standardise the definition and identification of pseudogenes. For example, in S. sonnei Ss 046 most of the newly annotated genes correspond to insertion sequences (ISs) and small fragments of incompletely annotated pseudogenes (Table S3, S. sonnei sheet). As a result of this effort, the present ColiScope database contains a complete and consistent set of annotations for the 7 newly sequenced genomes and the 14 available Escherichia and Shigella genomes. These data were the starting point of the work presented here. We analyzed gene order conservation within the 21 genomes (Table S4). More than half of the genomes have exactly the gene order of E. coli K-12 MG1655, which we inferred as ancestral. Thus, the organisation of the core genome is stable in most strains. Three genomes show 1 or 2 rearrangements. Seven genomes show more than 10 blocks of synteny: 6 of these genomes are from Shigella, the high rearrangement rates of which resulted in up to 65 blocks of synteny in S. dysenteriae. These genomes have a large number of ISs, ranging from 549 to 1155 in S. flexneri and S. dysenteriae, respectively, which are well known to shuffle genomes. E. fergusonii also shows a large number of rearrangements relative to the ancestral organization of the E. coli genome. Since the organisation of some strains of the more distantly related Salmonella enterica closely resembles that of E. coli K-12 MG1655, many rearrangements must have taken place in the branch leading to E. fergusonii. Figure S1 provides the classical concentric circle representation for the 7 genomes we sequenced, showing GC skews, G+C variation, and a description of the presence of genes in ever-increasing clades within the genus, relative to the inferred ancestral genome. The first position of the sequences was chosen to match the orthologous region in the E. coli K-12 MG1655 genome and corresponds to the intergenic region between lasT and thrL. Origins and termini of replication were identified by GC skews and homology with the respective E. coli K-12 MG1655 regions. These figures show that divergence from the average G+C content often occurs in genomic regions absent in the other strains. They also reveal the highly mosaic structure of these genomes, comprising the core genes and the accessory genes, which we then set out to quantify. The Core and Pan-Genomes of E. coli The analysis of the first E. coli genomes changed our views about the evolution of gene repertoires in bacteria. Genomes within the species vary in size by more than 1 Mb, i.e., by more than 1000 genes, and even the gene repertoires of similarly sized genomes differ widely [30],[34]. We have thus taken advantage of the unprecedented availability of 20 completely sequenced genomes of the same species to analyse the evolution of the gene repertoire. We first identified the core and pan-genomes of E. coli, i.e., the genes present in all genomes and the full set of non-orthologous genes among all genomes. In our data set, the average E. coli genome contains 4721 genes, the core genome contains 1976 genes, and the pan-genome contains 17 838 genes. The random sampling of one gene within a randomly selected E. coli genome has a probability of only ∼42% of revealing a ubiquitous gene. On the other hand, the full sequencing of an E. coli strain allows observation of only one-fourth of the observed pan-genome. This implies that although some fundamental functions can be well studied by using a model strain, no single strain can be regarded as highly representative of the species. Further sampling of E. coli genomes is unlikely to change significantly the estimate of the core genome, however, the pan-genome is far from being fully uncovered (Figure 1). Annotation and sequencing artefacts may affect the estimations of core and pan-genome sizes, e.g. by spurious annotation of small genes or pseudogenes. We hope to have minimised such problems by using a coherent set of annotations. Still, we found that 40 genes deemed essential in E. coli K-12 W3110 [35] were missing in the core genome. Among these, 17 correspond to genes with conflicting reports of essentiality, or contextually essential genes such as prophage repressors, and are absent in most genomes. The other 23 genes have orthologs in most genomes and 19 are missing in a single genome where they can be found as pseudogenes interrupted by a single-nucleotide frameshift. While “pseudogenisation” does often start with such frameshifts [36], these genes correspond to core housekeeping functions, so the reported frameshifts probably represent sequencing errors. For example, it is hard to see how S. boydii could replicate without the catalytic α-subunit of the DNA polymerase III or how E. coli 536 could survive without a tyrosine tRNA synthetase. We found some comfort in verifying that none of the 23 genes was absent from the 7 genomes we sequenced. If one assumes that these essential genes cannot be deleted and that no special care has been taken to check for sequencing errors at these loci, then our estimation of the core genome should be increased by a factor of 260/(260-23) to 2167 genes. This still makes the core genome less than half of the average E. coli genome (∼46%). Importantly, no gene of the core genome, nor any operon ubiquitous in E. coli, was unique to the species, i.e., we could always find a homolog in at least one of the other fully sequenced bacterial genomes. 10.1371/journal.pgen.1000344.g001 Figure 1 Escherichia coli core and pan-genome evolution according to the number of sequenced genomes. Number of genes in common (left) and total number of non-orthologous genes (right) for a given number of genomes analysed for the different strains of E. coli. The upper and lower edges of the boxes indicate the first quartile (25th percentile of the data) and third quartile (75th percentile), respectively, of 1000 random different input orders of the genomes. The central horizontal line indicates the sample median (50th percentile). The central vertical lines extend from each box as far as the data extend, to a distance of at most 1.5 interquartile ranges (i.e., the distance between the first and third quartile values). At 20 sequenced genomes, the core-genome had 1976 genes (11% of the pan-genome), whereas the pan-genome had (i) 17 838 total genes (black), (ii) 11 432 genes (red) with no strong relation of homology ( 0.05), suggesting that our inference is unbiased with respect to successive events taking place at the same locations in long branches or by selection-purging older events in internal branches. Variation in gene repertoires has been described as being scattered on the chromosome of E. coli and balanced between the two replichores [61]. For the numerous small insertions and deletions this distribution results naturally from random insertion/deletion of genetic material. Such small indels are expected to have little impact on the large-scale organisation of the genome. What about the very large insertions/deletions? The 554 such events that involve more than 10 genes over-represent insertions over deletions (Fisher exact test, p 0.05). This shows that insertion/deletion hotspots are also rearrangement hotspots, even though we initially removed rearranged positions to identify the insertion/deletion hotspots (thus being conservative). It also suggests that rearrangements occur in these regions because they are permissive to change not because they are intrinsically recombinogenic, since the frequency with which they rearrange simply reflects their larger size. However, even if hotspots are not intrinsically recombinogenic they can still be caused by the brokering effect of homologous recombination. Indeed, incoming DNA once integrated in one genome can propagate within the population by lateral transfer via classical homologous recombination involving the homologous flanking regions. Given the observed rates of recombination in the species, this mechanism could quickly lead to the horizontal spread of highly adaptive newly acquired genes. We describe some evidence for this in the next section. Hotspots of Phylogenetic Incongruence For any given sequence alignment, the likelihood of the overall gene tree topology, i.e., the phylogenetic congruence, reflects the extent to which the phylogenetic signal of the sequences was altered by recombination. While the concatenate of genes provides a strong phylogenetic signal, the individual genes' histories can be very diverse as a result of recombination. Furthermore, these histories may depend on the genes' positioning in the chromosome. Notably, if homologous recombination helps in disseminating recent acquisitions, as we propose, the core genome around these hotspots should show signs of recombination as indicated by phylogenetic incongruence. We therefore made an analysis in 5 kbp sliding windows along the multiple genome alignment to identify the most phylogenetically incongruent regions (see Material and Methods). This method identified two large regions of very strong incongruence, one centred around rfb (Figure S8), the operon involved in O antigen synthesis, and the other around the leuX tRNA gene, and including fimA, which is under diversifying selection and is involved in the adhesion of bacteria to host cells [66]. Both loci were previously identified as hotspots of phylogenetic incongruence [67],[68]; the present analysis reveals how much they affect the chromosome. Recombination at the rfb locus significantly affects congruence within a striking 150 kbp surrounding region, i.e., from positions 1988 kbp to 2138 kbp (100% of windows tested had scores lower than 1.96 standard deviation away from the average, with an average of −4.84 and peaks at −10.19). The fim locus includes an incongruence region close to 200 kbp in length (from positions 4421 kbp to 4618 kbp, average −2.54 standard deviation and 73% with lower than −1.96 standard deviation and peaks at −6.65). Interestingly those two regions are centered on integration hotspots and encompass 11 of the 133 hotspots of integration. The genes present in such loci arose most likely by lateral transfer since they are highly dissimilar between strains. For example, genes at the rfb locus genes can exhibit less than 50% similarity, while the leuX locus encompasses a highly variable assortment of non-homologous inserts in all the genomes sequenced. Hence at least for those two major loci we find a striking link between hotspots of integration and hotspots of homologous recombination. In the case of the rfb locus, it is worth noting that the incongruence signal we observe might be a composite signal, due not only to rfb but also to neighbouring loci. Within the above defined rfb region of incongruence, a flagella locus (fli operon) associated with two hotspots of integration is also under diversifying selection. Moreover, the high pathogenicity island (HPI) is integrated within that high recombination region in many isolates and corresponds also to a hotspot of integration. It has been suggested that after a recent and unique integration event, the HPI has propagated within the species by homologous recombination [69]. The propagation or diversification of these loci, located to the left of rfb, through homologous recombination might generate the asymmetrical pattern of phylogenetic incongruence we observe around the rfb locus (extended incongruence on the left side of the rfb locus) (Figure S8). We found 23 other regions with weaker signatures of incongruence (i.e., with a 5 kbp sequence incongruence score more than 2 standard deviations from the average), each spanning less than 20 kbp. It is important to note that most of these incongruent regions include genes involved in diversification of genetic information and often pathogenicity. The vast majority of these include 3 groups of common genes. First are regions with the porin-encoding genes ompA and ompC, the flagella-encoding genes, the rfa locus coding for the core lipopolysaccaride and genes coding for several membrane proteins such as LolCDE, CcmABCDE, ABC transporter, AroP APC transporter, LplT-aas, FadK, YeaY, EamB, YhgE and YicG membrane proteins. These loci are probably involved in diversifying selection since they code for antigenic proteins exposed at the cell surface. Second are two regions encompassing mismatch repair genes (mutS and mutH) that have been shown to be under selection for cycles of inactivation and reacquisition through recombination [70]. Third is a region associated with the integration of a locus that can provide resistance to phages through clustered regulatory interspaced short palindromic repeats (CRISPRs) [71]. All available methods estimate the effective, not the intrinsic, recombination rate. Effective recombination results from the intrinsic recombination rate and the ensuing selection on recombinants. Most of the phylogenetic incongruence hotspots we found contain genes under diversifying selection, for instance to escape immune pressure or to acquire resistance to phage. Hence, it is very likely that differences in the intensity of selection might be responsible for the differences observed in the size of the regions affected by a phylogenetic incongruence hotspot. A recombinant carrying a new allele at a locus under strong diversifying selection will be selected and thus will increase rapidly in frequency in populations. Hence, the recombinant will invade the local population before any further recombination occurs at the locus [72]. In that case, sampling the genome after the action of natural selection allows identification of the original recombining fragment. In contrast, if selection is moderate, the recombining fragment that brought the interesting allele into the genome will be covered by many further recombination events before it reaches high frequency. In this case, only fragments around the selected allele will retain the trace of the recombination event. As a consequence, when selection is intense, one expects to identify long recombinant fragments in some strains, as we did at the rfb or leuX loci. Our observations suggest that the intensity of diversifying selection acting on the rfb and leuX-fimH loci are under very strong selective pressure compared with the diversifying selection acting on the core LPS, the flagella or some of the porins. The fact that most hotspots of integration (117 among 133) do not result in hotspots of phylogenetic incongruence suggests that they carry neutral or deleterious genes. Conversely, it also suggests that some horizontally acquired genes can be highly beneficial (e.g., 11 hotspots of phylogenetic incongruence around the rfb or leuX-fimH locus) or moderately beneficial (e.g., 4 hotspots of integration associated with hotspots of phylogenetic incongruence) and that this results in different selection footprints in the neighbouring core genome. Recombination and Chromosome Organisation The existence of integration and phylogenetic incongruence hotspots brings to the fore the conflict between genome dynamics and organisation. We therefore analysed the variation in recombination along the backbone sequence (as estimated by a population genetic-based approach), using a sliding window of 3 kbp on the multiple genome alignment and a step size of 500 bp. This analysis revealed a large region around the terminus of replication with a particularly low ratio between gene conversion and mutation rates (Cgc/theta) (Figure 10). The region between 1 Mb and 2 Mb shows lower gene conversion rates, since there is 20% lower chance for a base to be involved in a gene conversion event (Cgc×Lgc, unilateral t test: p = 1e-21). This region also shows 10% lower levels of polymorphism (theta of Watterson, p = 1e-7), i.e., variations within the E. coli species, and 2% lower G+C content (Figure 10). A+T richness at the terminus region has been suggested to result from higher mutation rates [73]. Based on comparative genomics with Salmonella, it was also shown that divergence, i.e., the genetic distance between species, slightly increased closer to the terminus [74],[75], further supporting the hypothesis of a higher local mutation rate. Using our newly sequenced outgroup genome E. fergusonii, which unlike Salmonella does not shows saturation of synonymous substitutions, we found that the terminus domain has synonymous and non-synonymous substitution rates twice as high as the rest of the chromosome. While decreased G+C content and increased divergence could reflect a higher mutation rate at the terminus, such an interpretation is contradicted by the observed lower polymorphism. 10.1371/journal.pgen.1000344.g010 Figure 10 Standardized cumulative sum of effective gene conversion rate and G+C content. Gene conversion rate (i.e., probability of being involved in a gene conversion event Cgc.Lgc) is shown in blue, and G+C content in red. A decrease in the cumulative sum reflects regions of lower-than-expected values of the statistics. Around the terminus domain, we found a decrease in both recombination and G+C content. Coloured boxes represent the 4 different organisation macrodomains (Right, Ter, Left, Ori). The arrows point towards the origin and terminus of replication. Theoretical population genetic studies have shown that the fluctuation of recombination frequency along chromosomes affects the level of polymorphism and the efficiency of selection [76]. When there are numerous deleterious mutations and low recombination rates, a fraction of the population bearing deleterious alleles is doomed to disappear in the long term without contributing to the gene pool of the future population. The relevance of this phenomenon, referred to as background selection, requires the existence of deleterious mutations of moderate effects, i.e., mutations that can persist for some time in the population before selection wipes them out. At the population level, this result in an excess of rare alleles, which can be estimated by Tajima's D statistics. We found that overall a gene's average Tajima's D was slightly negative (indicating an excess of rare alleles). However, the Tajima's D of synonymous mutations was null, while that of non-synonymous mutations was much more negative (Figure S9). This suggests that most non-synonymous mutations are deleterious since, in contrast to synonymous mutations, they do not increase in frequency within the population reflecting the purging effect of natural selection. Therefore the conditions for the action of background selection are met. Furthermore, under background selection, a reduced recombination rate results in a decreased polymorphism (such as we observed around the terminus), an increased fraction of rare alleles and a decreased efficiency of selection [76]. The terminus region shows a lower Tajima's D than the rest of the chromosome (Student bilateral test, p 8 mice killed) [32]. Supporting Information Figure S1 Circular representation of the six Escherichia coli genomes and the E. fergusonii genome. Circles display from the inside out: (1) GC skew (G+C/G−C using a 1 kbp sliding window). (2) Location of tRNA genes, rRNA operons and Insertion Sequences (ISs). (3) GC deviation (difference between the mean GC content in a 1 kbp window and the overall mean GC). Red areas indicate that the deviation is greater than 2 standard deviations. (4) Ancestral E. coli genome. Yellow areas denote genes that are present in all the genomes under study. (5) Scale. (6) Gene specificity at strain level. Genes sharing at least one homolog in an other E. coli strain of the same phylogenetic group and having more than 85% identity over at least 80% of their length were regarded as non specific. To simplify the visualisation of specific regions, we created a colour gradient that denotes the percentage of organisms that possess a homolog of a given gene within the reference genome. If this particular gene is present in all the organisms under study, it is tagged in light grey. Conversely, if it is present only in the reference genome, it is tagged in dark colour. In other words, the more pronounced the colour, the higher the specificity. (7) Gene specificity at group level. The same criteria were used as for circle (6) but the genome analysed is compared to E. coli strains that belong to other phylogenetic groups. The comparison includes Shigella as well. (8) Gene specificity at the species level. The same protocol was used as for circles (6) and (7) except the comparison involves E. fergusonii which is considered as the outgroup for this study. (1.82 MB PPT) Click here for additional data file. Figure S2 Visual representation of MAUVE multiple alignment of 20 Escherichia coli/Shigella genomes. The representation was performed using the MOSAIC database (http://genome.jouy.inra.fr/mosaic/) multiple alignment viewer. Horizontal lines correspond to a linear representation of each genome sequence drawn to scale. The blue line corresponds to annotated genes. (At this scale only a unique line is visible.) Coloured blocks correspond to the locally collinear blocks (LCBs) of the alignment as defined by MAUVE. LCBs corresponding to inversions are represented on a second line. An LCB in one genome is linked to the corresponding LCB on the subsequent genome with a plot of the same colour. This visual representation shows that, apart from the rearrangements present in Shigella chromosomes, E. coli genomes are mostly collinear. (0.11 MB PPT) Click here for additional data file. Figure S3 Phylogenetic tree of the backbone of the 20 Escherichia coli and Shigella strains as reconstructed by MAUVE software. This unrooted tree was built using Tree-puzzle with the HKY+gamma (with 8 categories)+I model followed by BioNJ to reconstruct the tree from the distance matrix. The values at the nodes correspond to support values for each internal branch, as estimated by Tree-puzzle (range 0–100), and can be interpreted in much the same way as bootstrap values. (0.09 MB PPT) Click here for additional data file. Figure S4 Association between gene repertoire relatedness and phylogenetic distance. A. Genomes were binned according to phylogenetic distance for clarity. For the first two bins, which correspond to the most closely related genes, there is a high percentage of genome in common, which is not the case for the other bins, which correspond with more distantly related genes. B. Histogram of the phylogenetic distances between pairs of genomes. (0.04 MB PPT) Click here for additional data file. Figure S5 Reconstruction of gains and losses of genes in the evolution of Escherichia coli. The cladogram shows the phylogenetic relationships between the 20 E. coli/Shigella genomes rooted on the E. fergusonii genome, as in Figure 4, with branch lengths ignored for clarity. Each strain and internal node of the tree is labelled with the inferred numbers of genes gained (red: top) and lost (black: top), and the inferred numbers of corresponding events of gene acquisition (red: bottom) and loss (black: bottom) along the branch. Pie charts on each branch indicate the functional classification of genes lost, using the colour-scale (details in the keys). The functional classes of known-function genes are represented by numbers explained by a key in Supplementary Table 4. (0.31 MB PPT) Click here for additional data file. Figure S6 Association between the distance of a node to the tip of the tree and the difference between the predicted ancestral genome size and the effective number of genes reliably predicted to be present in the node. The association is highly significant (R2 = 0.56, p<0.001). (0.03 MB PPT) Click here for additional data file. Figure S7 Characteristics of hotspots of insertion/deletion of genetic material. The circles indicate the values per location between contiguous genes in the core genome. Data are (from the outside circle inwards): average number of genes, standard deviation, sum of genes, number of prophage-like elements, number of insertion sequence like elements, sum of tRNA genes and heterogeneity rate at hotspots. The latter is the ratio between the observed number of orthologs and the expected value if all genes had orthologs in all genomes, after excluding genomes lacking genes at the hotspot and those for which the region has a synteny breakpoint. (0.14 MB PPT) Click here for additional data file. Figure S8 Phylogenetic congruence at the rfb locus coding for O antigen. We followed the likelihood of the species topology for 5 kbp windows (spaced by 250 bp) along the chromosomal backbone. After correcting for the number of polymorphic sites, each window received a Z score of phylogenetic congruence. Low values reflect lower than average phylogenetic congruence. A large region (green arrow) has a significantly lower congruence than the rest of the genome. The red arrows indicate the hotspots of integration and the corresponding loci when identified. HPI: high pathogenicity island. (0.03 MB PPT) Click here for additional data file. Figure S9 Distribution of Tajima's D statistics on the 1976 Escherichia coli core genome genes. The colour code is as follows: all mutations (red), synonymous mutations (green) and non-synonymous mutations (yellow). Negative Tajima's D values [126] reflect a higher than expected frequency of rare alleles. The more negative value of Tajima's D for non-synonymous mutations suggests that they are on average deleterious: they persist some time in populations before selection removes them. (0.08 MB PPT) Click here for additional data file. Table S1 Pseudogenes found in Escherichia coli K-12 MG1655 and the 7 newly sequenced genomes of the ColiScope project. The ‘Reference’ column give the coding sequence (CDS) identifier of the wild type form of the gene, in one of the 21 analyzed Escherichia and Shigella bacterial genomes. For each of these 21 genomes the particular gene's status is indicated as functional (‘1’), absent (‘0’) or a pseudogene (‘−1’). Gene names in boldface correspond to genes that are pseudogenes only in the considered strain. (0.25 MB XLS) Click here for additional data file. Table S2 A) Number of predicted protein encoding genes in the genomes of the newly sequenced strains of Escherichia coli and E. fergusonii. Genes were (a) functionality annotated using automatic annotation transfer from K-12 MG1655 orthologs or other ColiScope manually annotated orthologous genes, (b) manually annotated using the MaGe web-based graphical interface, or (c) considered as false positive gene predictions. B) Publicly available Escherichia and Shigella genomes included in the ColiScope database. (a) Inaccurate (‘Wrong’ status) or missed gene annotations (‘New’ status) have been found using our MICheck procedure. For the 14 analyzed genomes, the list of newly predicted genes is given in Supplementary Table 3. (b) Automatic functional annotation transfer between orthologous genes (85% identity over at least 80% of the length of the smallest protein) began with similarity results obtained with E. coli K-12 MG1655, then with the new genomes of the ColiScope project. False gene predictions (i.e., artefacts) were those defined in the course of the expert annotation of the ColiScope sequences. (c) ‘Specific genes’ are genes that have no ortholog in E. coli K-12 MG1655 or any of the newly sequenced and annotated genomes. (0.05 MB DOC) Click here for additional data file. Table S3 Missing genes in publicly available Escherichia coli and Shigella genomes. The genes are ordered by length (given in base pairs). Those that are similar to genes from the minimal gene set defined by [127] are highlighted in boldface. Functional descriptions of the genes, starting with ‘fragment of’ (Product column), are provided for putative pseudogenes (whether actual pseudogenes or sequencing errors). For some E. coli stains, e.g., UTI89, the corresponding pseudogenes were probably correctly annotated by the authors (numbering of the gene locus_tags), but were not reported in the databank files (GeneBank and EMsBL), and thus were annotated as missing genes by the MICheck procedure [97]. (0.36 MB XLS) Click here for additional data file. Table S4 Synteny blocks and insertion sequence (IS) elements among the 21 Escherichia coli/Shigella/E. fergusonii genomes. (0.03 MB DOC) Click here for additional data file. Table S5 Classification of bioprocesses (key for Figure 7). The tests indicate the sense of the difference in the number of genes associated with a given bioprocess. ‘+/−’ means more/fewer genes in the first class, i.e., more/fewer in the core genome than in the complementary set. ‘++/−−’ means the difference is significant at the 5% level, using a Chi square test followed by a sequential Bonferroni correction for multiple tests. (0.02 MB DOC) Click here for additional data file. Table S6 Genes (and associated characteristics) categorically associated with certain phylogenetic groups or pathotypes. Principal characteristics of the genes were deduced from the annotation process. (0.23 MB DOC) Click here for additional data file. Table S7 Genes of the genomic island at the pheV tRNA insertion hot spot (see Figure 9). (0.08 MB XLS) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              A Guild of 45 CRISPR-Associated (Cas) Protein Families and Multiple CRISPR/Cas Subtypes Exist in Prokaryotic Genomes

              Introduction Clusters of short DNA repeats with nonhomologous spacers, which are found at regular intervals in the genomes of phylogenetically distinct prokaryotic species, comprise a family with recognizable features [1–3]. These repeats were first observed by Ishino and colleagues [4] upstream of the iap gene in Escherichia coli and later in other bacterial and archaeal species such as Haloferax mediterranei, Streptococcus pyogenes, and Mycobacterium tuberculosis. Each member of this family of repeats was designated differently by the original authors, leading to a confusing nomenclature: SPIDR (spacers interspaced direct repeats) by Jansen and colleagues [5], SRSR (short regularly spaced repeats) by Mojica and colleagues [2], and LCTR (large cluster of 20-nt tandem repeat sequences) by She and colleagues [6], among others. Based on a systematic characterization in 40 different bacterial and archaeal genomes, Jansen and colleagues [3] proposed, in agreement with Mojica's research group, a new nomenclature for this family of DNA repeats, which are now known as clustered regularly interspaced short palindromic repeats (CRISPRs). Sequencing of the genome of the archaeon Methanococcus (now Methanocaldococcus) jannaschii [7] led to the first detailed characterization of these repeats in a complete genome, where 18 loci were found, most featuring a single copy of a long repeat (LR) or leader sequence and a variable number of regularly spaced short repeats (SRs). In M. jannaschii, two of the LRs were truncated, each ending with a single SR, and one cluster of five SRs had no LR. Similar repeats were identified in the genome of the hyperthermophilic bacterium Thermotoga maritima [8]. The association of these repeats with nearby gene clusters that showed closest similarity to archaeal species and their atypical DNA composition (as measured by χ2 analysis) were called consistent with other observations as evidence of lateral gene transfer (LGT) between archaeal and bacterial species [8]. Together, these findings suggested transfer of repeat-associated DNA within and between prokaryotic genomes. Four genes, designated CRISPR-associated (cas), are found only in species containing CRISPR, always located near to a repeat locus, and usually oriented head-to-tail as if cotranscribed [3]. The most common arrangement of these genes is cas3–cas4–cas1–cas2. The Cas3 protein appears to be a helicase, whereas Cas4 resembles the RecB family of exonucleases and contains a cysteine-rich motif, suggestive of DNA binding. Cas1 is generally highly basic and is the only Cas protein found consistently in all species that contain CRISPR loci. Cas2 remains to be characterized. None of the other genes in the vicinity of these four cas genes were found to represent protein families specifically associated with CRISPR. There has been only limited biological characterization of these elements. Efforts to increase the copy numbers of these repeats in the archaeon Haloferax volcanii resulted in altered segregation and reduced viability of the cells [1]; a role in replicon partitioning was suggested. Supporting this, small clusters of repeats are found in two self-transmissible plasmids of the genus Sulfolobus; these plasmids appear more stably maintained than plasmids lacking repeats [9]. The main chromosome of Sulfolobus solfataricus, unlike the plasmids themselves, has both cas gene clusters next to repeats [3] and a genus-specific binding protein (SSO0454) for its own and for the plasmids' repeats [10]. Tang and coworkers [11] found 22 small, nonmessenger RNAs transcribed from CRISPR arrays of the archaeon Archaeoglobus fulgidus, nearly all had sizes just below a repeat plus a spacer with the 3′ end in the middle of the repeat; repeat arrays of S. solfataricus also were transcribed into RNA and processed. Two recent analyses of spacer elements found between individual CRISPRs demonstrate that most have no conclusive origin by sequence similarity; those that do, strikingly, match phage or other types of selfish genetic elements [12,13]. Despite these advances, the functions of both CRISPR and Cas proteins remain unknown. In this study, we present a detailed analysis of four previously defined and 41 newly identified Cas proteins in the microbial species for which we have complete genome sequences. Results Identification of New CRISPR-Associated (Cas) Protein Families In addition to the previously described cas genes (cas1–4), we have detected a number of protein families whose members are found in the vicinity of CRISPRs across multiple prokaryotic species. Hidden Markov models (HMMs) for these families have been constructed and deposited in the TIGRFAMs database (http://www.tigr.org/TIGRFAMs; Table 1). In the present work, the CRISPR-associated protein families are described as a “guild,” i.e., a collection of members that perform somewhat similar work. The guild is presumed to be involved in processes that may include the maintenance of repeat clusters [3], capture of new spacer elements [12,13] and expansion or contraction of clusters, propagation of the leader sequence and repeat clusters within a genome [3,7], transfer of CRISPR and cas genes together to new genomes [8,14,15], and interaction of CRISPR/cas loci with the host cell (see Discussion). From our study, a total of 53 HMMs have been constructed that represent at least 45 different protein families (including Cas1–4). The discrepancy between the number of HMMs and protein families results from two pairs of models for the Cas2 protein and the Cas3 protein (Table 1), which have enough sequence divergence that a single model is not sufficient. Also included is a model for the HD domain of Cas3, which in some cases is a separate polypeptide and in others is absent. Finally, in addition to a model that detects the diagnostic domain of Cas5 (see below), we present five narrower models that detect the five subtype-specific full-length variants of this family (Table 1). Many of these families contain members that belong to clusters of orthologous groups (COGs) [14,16], although the relationship between the HMMs described here and these COGs is imperfect (see Discussion). The functions of these protein families are largely unknown, although distant homologies to characterized proteins, motifs, and domains have been noted in the present study and in previous analyses [14] (Table 1). For example, eight families of CRISPR-associated proteins (Csm3–5, Cmr1, Cmr3, Cmr4, Cmr6, and Csx7) all belong to the repair-associated mysterious protein (RAMP) superfamily [14] as detected by Pfam [17] model PF03787. These RAMP families appear to act in concert since sets of them typically are found in gene clusters (Figure 1). The assignment of genes to these 45 families has allowed for an analysis of the genomic context in which they are typically found. Three basic types of family context have emerged. First, the “core” cas genes (i.e., cas1–4) are found in a wide range of contexts with respect to the other gene families, whose genes are clustered nearby. Second, subtype-specific genes are found in association with the core genes and others of the same subtype, often with conserved gene order. Finally, modular genes, associated with one another in particular combinations, are always found in genomes containing the core genes, but may be found at distant sites from those clusters. Based on the observation of such contextual patterns, we have defined two additional core cas genes (cas5 and cas6), eight CRISPR-associated subtypes, and one CRISPR-associated module each of which is described in detail below and presented in Table 1. Each of the subtypes has been named for a genome in which it appears as the only CRISPR locus (e.g., CRISPR subtype Apern after Aeropyrum pernix; see Table 1, Figure 1), and the associated subtype-specific genes have been assigned gene symbols based on a three-letter prefix and a numeral suffix (i.e., csa1) following the cas1–4 model (Table 1, Figure 1). The module (CRISPR RAMP module, cmr1–6) has been named after the RAMP superfamily since four of the six genes appear to be members of this superfamily. Models for a number of families have been constructed for which no contextual pattern has yet been defined, most likely due to an insufficient number of genomes harboring the gene. These have been assigned gene symbols with the prefix “csx” (Table 2). Our assignments of genes to CRISPR-associated families has allowed for the identification of CRISPR/cas loci that span the genomic distance between CRISPR arrays not previously appreciated as forming the same locus (e.g., Bacillus halodurans C-125 and Aquifex aeolicus VF5; see Figure 1). Indeed, it appears to be a rule that virtually every gene found between any two cas genes is strictly CRISPR-associated itself, although it may not be as common as the core cas genes, cas1–cas6. The exceptions are putative transposases, restriction enzymes, and addiction module proteins, or hypothetical proteins with few or no homologs; several examples appear in Figure 1. Frequently we have found that the addition of new genomes to our databases shows such hypothetical proteins to belong to a new family of cas genes. Through this process of slowly building up our library of cas gene families, the patterns of conserved subtypes, previously obscure, has come into sharp focus (Figure 1). New Core cas Gene Families: cas5 and cas6 The cas core genes (cas1–4) were originally delineated by Jansen and colleagues [3] and are characterized by their close proximity to the CRISPR loci and their broad distribution across bacterial and archaeal species. Although not all cas core genes associate with all CRISPR loci, they are all found in multiple subtypes (Table 3). We have observed a 43-amino acid N-terminal domain, which appears in a single protein in each of five separate CRISPR/Cas subtypes and in a number of currently untyped loci. We designate these families collectively as Cas5. Members average 250 amino acids in length; regions outside the N-terminal domain form subtype-specific families with remote to undetectable homology across subtypes. For this reason, we have included both subtype-specific full-length models and the domain model in the TIGRFAMs library (see Table 1) and have assigned gene symbols with a trailing letter to indicate the subtype variant in question (e.g., cas5e, for the Ecoli subtype variant). The Cas5 domain is found in the M. xanthus DK1622 DevS protein, which has been implicated in a species-specific developmental pathway [18,19], although found within an apparent CRISPR/cas locus of a novel subtype. There may be a distant homology relationship between Cas5 proteins and the RAMP superfamily, though not detected by the Pfam [17] RAMP superfamily model PF03787 (data not shown). The Cas6 family includes proteins averaging 140 amino acids in length that share a strong homology at the C-terminus, including a GhGxxxxxGhG motif (“h” indicates a hydrophobic amino acid). The cas6 gene is found in association with four separate CRISPR/Cas subtypes (see Table 3) and has a preferred location as the cas gene most distal to the CRISPR (Figure 1). Description of the Different CRISPR/Cas Subtypes Tables 1, 3, and 4 and Figure 1 delineate the essential features of the eight CRISPR/Cas subtypes that we have defined thus far, including the subtype-specific (see Table 1) and core (see Table 3) genes involved, the length and periodicity of the repeats and the length distributions of the associated spacers (Table 3), the co-occurrences and subtype fusions observed (see Table 4), and the species in which they are found (Figure 1). Distinctive and notable features of these subtypes will be discussed below. Each subtype is named for the species of a genome sequence in which only that subtype is found. CRISPR/Cas Subtype Ecoli (Based on Escherichia coli K12-MG1655) The Ecoli subtype features five subtype-specific genes and cas1–3. The cas2 gene associated with this subtype is sufficiently diverged from the rest of the Cas2 family so that the construction of a separate HMM (TIGR01873) was necessary. The 61-bp average (see Table 3) periodicity is unique among the subtypes we describe, and we never find an Ecoli-type cluster fused to another type. This subtype is sporadically distributed among bacteria and not found in any of the sequenced Archaea present in the Comprehensive Microbial Resource (http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl) [20]. Saunders and coworkers [21] report a cluster of “bacterial-specific” CRISPR-associated genes in the incomplete genomic sequence of the cold-adapted archaeon Methanococcoides burtonii that prove to be cas3, cse1, cse2, and cse4; we detect a second cluster in this organism with the remaining four required genes. CRISPR/Cas Subtype Ypest (Based on Various Yersinia pestis Strains) The Ypest subtype is unique in its lack of a Cas2 homolog. It has the shortest average repeat periodicity (only 60 bp; Table 3), the most well-conserved repeat sequence from one species to another, and easily the narrowest phylogenetic range. It is observed only in several Gammaproteobacteria sp. and one Betaproteobacterium. The Cas3 putative helicase associated with this subtype is sufficiently diverged from the rest of the Cas3 family that the construction of a separate HMM (TIGR02562) was necessary. It is the spacers of the Ypest-subtype repeats in Y. pestis, which were analyzed by Pourcel and colleagues [13]. The cas1 gene associated with these clusters is most closely related to that of the Ecoli subtype (Figure 2), which has the next shortest repeat periodicity. CRISPR/Cas Subtype Nmeni (Based on Neisseria meningitidis Serogroup A Z2491) This subtype is the most abbreviated that we have described, being the only one lacking cas3, cas4, and cas5 and having the shortest average spacer lengths observed (30 bp; Table 3). Jansen and colleagues [3] noted similarity between Cas4 and RecB family exonucleases. Members of the Csn1 family, by contrast, contain an McrA nuclease-like domain [14]. Csn1, a large and likely multidomain protein, may perform the functions of the absent Cas4 and potentially Cas3 as well. A second subtype-specific gene, csn2, is present in some but not all Nmeni CRISPR/cas loci. A characteristic feature of this subtype is a single copy of the repeat (sometimes direct, sometimes inverted), which appears upstream of the first gene in the locus, in addition to the repeat cluster downstream of the last gene. Notably, species bearing this CRISPR/Cas subtype are, without exception, vertebrate pathogens and commensals. “Three-Gene” CRISPR/Cas Subtypes: Dvulg (Based on Desulfovibrio vulgaris Hildenborough), Tneap (Based on Thermotoga neapolitana DSM4359), and Hmari (Based on Haloarcula marismortui ATCC 43049) A number of subtypes appear to have a similar overall structure consisting of cas1–4 (and, with the exception of Dvulg [see below] of cas6 as well) and three subtype-specific genes, including the subtype-specific cas5 variants (see Figure 1). Typically, one of these shows homology to the DevR protein, which has been characterized as a negative autoregulator and is the presumed partner of DevS/Cas5 from M. xanthus DK1622 [22], while the last is a large (400–700 amino acids) protein, often containing a CXXC–CXXC motif. These three genes are always adjacent to cas3. Each of these types generally is associated with repeat spacers of a distinct average length (Dvulg, 34; Hmari, 36; and Tneap, 37; Table 3), and all have cas1 genes that are more closely related to one another than to the cas1 genes of other subtypes (see Figure 2). Several genomes contain CRISPR/cas loci that also appear to conform to this structural class, but the number of sequenced genomes containing homologs is currently insufficient to create subtype-specific HMMs (e.g., CRISPR/cas loci of Leptospira interrogans serovar Lai strain 56601 and Fusobacterium nucleatum ATCC 25586). In addition to the repeat cluster immediately downstream of the cas gene operon, CRISPR/cas loci of the Tneap and Hmari types, but not the Dvulg type, frequently have another cluster immediately upstream. CRISPR/Cas Subtype Apern (Based on A. pernix K1) This subtype is found only in Archaea and comprises the only described subtype in the Crenarchaeota (although the RAMP module is also observed; see below). Although this subtype is only found in thermophilic species, significance of this correlation is tempered by the fact that the large majority of archaeal species are thermophiles. In Sulfolobus sp. and A. fulgidus DSM4304, csa4 is absent, while csa5, a gene not found in A. pernix, is present, although they have no detectable homology. In similar fashion to the three-gene class of subtypes (see above), csa2 is a distant homolog of devR. The cas1 genes of this subtype are most closely related to those of the three-gene subtypes; indeed, the cas1 gene of M. jannaschii DSM2661 would appear by homology to be a Tneap type and may represent an instance of subtype recombination (Figure 2). CRISPR/Cas Subtype Mtube (Based on M. tuberculosis Strains CDC1551 and H37Rv) Although observed as the sole subtype in several genomes, this subtype is more commonly found in genomes containing other subtypes at remote sites and in hybrid, fused loci (e.g., T. maritima; see Figure 1). The subtypes with which Mtube have been observed to associate are Tneap, Hmari, and Apern (Figure 1). The repeats proximal to the Mtube genes in unfused loci have long, but variable average periodicities and spacers (Figure 3). When found in fused loci, the csm genes tend to be distal from cas1 and cas2, which are themselves proximal to the subtype-specific genes of the other subtype in the locus. Additionally, in these hybrid loci, the spacer length is typical of the subtype partner, not of the Mtube subtype. Occasionally, as is observed in Methanosarcina acetivorans C2A, an Mtube locus with a robust repeat array is found lacking all cas core genes, although they are found elsewhere in the genome in association with a CRISPR/cas locus of another subtype. This would suggest that the linkage between the core cas genes and the subtype-specific genes is weaker in this subtype and is somewhat akin to the behavior of the RAMP module (see below). The notable variability of the spacer lengths may also indicate a heterogeneity of origin for these repeat arrays. When the cas1 genes that are proximate to csm genes are incorporated into a tree such as the one displayed in Figure 2, they do not form a single clade, but are found in clades that include cas1 genes associated with those subtypes with which Mtube typically associates (data not shown). Additionally, three of the five subtype-specific genes (csm3–5) are members of the RAMP superfamily, and csm1 is a homolog of cmr2. CRISPR/Cas RAMP Module CRISPR/Cas systems include a six-gene module that appears to occur only in genomes that contain other CRISPR systems, whether or not those systems are nearby. The cluster of genes cmr1–6 is observed in a wide range of archaeal and bacterial genomes, but always in the same genome with other subtypes, and most often fused into hybrid loci (see Figure 1; Table 1). Four of the six genes in this module are members of the RAMP superfamily [14]. This RAMP module associates with the three-gene class (Dvulg, Tneap, and Hmari), as well as Apern and Mtube subtypes. We have observed one instance where a RAMP module is found in the same genome with an Ecoli type (in Thermus thermophilus HB8), but in this case the RAMP is actually part of a hybrid locus with an Mtube subtype. Degraded and Atypical CRISPR/cas Loci Expansion of the number of CRISPR families from four to 45, definition of CRISPR/Cas subtypes, and examination of over 200 genomes allow reexamination of cas pseudogenes and degenerate CRISPR/cas loci. The genome of Thermoplasma volcanium, e.g., should be viewed not as having lost much of the system present in many other Archaea [14], but rather as having a CRISPR/cas locus of the Mtube type, while many other Archaea have loci of the Apern type (Figure 1). We believe we have observed evidence of ongoing processes of CRISPR locus degradation in multiple independent cases in five separate subtypes (Ecoli, Ypest, Nmeni, Dvulg, and Mtube) in bacteria. Examples include both cas1 and cas2 pseudogenes adjacent to apparently intact subtype-specific genes (allowing subtype identification) and loci in which novel subtype-specific genes are degenerate. A dramatic case of degradation is found in the genome of Coxiella burnetii RSA 493, in which the Ypest locus contains frameshifts and truncations in four of the six genes, and the repeat array consists of only a single exact copy of the Ypest repeat. The Ypest repeat is so well conserved that it can be used to search for genomes in which it is the only remaining trace of a Ypest CRISPR locus. BLAST searches of a consensus sequence created from alignments of the repeat can detect instances with up to four mismatches. We have detected tiny arrays of one or two full-length copies and one additional partial repeat (with the expected spacing of exactly 60 nt) only in several strains of E. coli, Shigella flexneri, and Shewanella oneidensis (all of which are within the phylogenetic range of the Ypest subtype generally). Atypical CRISPR systems do occur, such as the previously overlooked repeat array in the genome of Thermoplasma acidophilum, where no cas genes are found, and the Ypest system of Zymomonas mobilis ZM4, where the cas gene cluster is far from the characteristic 28-bp repeat cluster. Large distances between cas gene clusters and their closest repeat clusters occur less often than cas pseudogenes adjacent to degenerate repeats. Excluding these rare exceptions, the average distance from the cas gene cluster to the nearest repeat cluster is well below 1,000 bp and varies according to CRISPR/Cas subtype, e.g., 180 bp for Dvulg, 232 bp for Ypest, and 414 bp for Apern. This spacing often accommodates the CRISPR leader sequence. Molecular Phylogeny of Cas Core Proteins The definition of the subtypes discussed above was driven by the observation of the conserved contexts of families of distinct genes. As has been mentioned above, the cas core genes are found across various subtypes, most definitively in the case of cas1, which appears to be nearly universal for CRISPR/cas loci. The molecular phylogeny of various Cas core proteins has been explored by the construction of multiple sequence alignments, restriction of those alignments to well-aligned regions, and the calculation of neighbor-joining trees. A representative tree for Cas1 is shown in Figure 2. Trees for other Cas core proteins showed largely the same pattern, although limited to the subtypes in which they are individually found. These trees were robust, showing insignificant differences in branching patterns when a variety of alignment regions and tree-building algorithms were used. The clustering of the Cas core proteins broadly recapitulates the subtype divisions that were defined independently of this information. There would appear to be a limited number of cases where the Cas core proteins do not share the same evolutionary history as their associated subtype-specific proteins. Discussion CRISPR is a widely distributed family of repeats in prokaryotes [1–3,5,7,15]. Preliminary insight into their biology came with the discovery that four different protein families occur in prokaryotes only if CRISPRs are present. These proteins are always near a set of these repeats and always include Cas1 [3]. In the current study, we built on these prior findings and established a number of HMM-defined Cas protein families. These protein families have been found to form conserved clusters across multiple genomes, which allowed us to create rules for the identification of specific subtypes of CRISPR/Cas system. From the study presented here, it is apparent that CRISPR/Cas systems are far more complex than previously appreciated. Forty-five distinct protein families associated with CRISPRs have been identified among the first 200 completed prokaryotic genomes. These are currently represented by 53 HMMs (Tables 1 and 2). These models are sensitive, in that they unambiguously identify genes, and are also selective, in that they do not identify genes in organisms lacking CRISPR/cas loci. The subtype-specific models accurately discriminate between the subtypes but may, infrequently, identify genes in novel CRISPR/cas contexts that, given sufficient additional genomes, would warrant the status of separate subtypes. Previous work by Makarova and colleagues [14], conducted on a smaller set of available microbial genomes and without the knowledge of the associated CRISPRs, resulted in the identification of some 20 gene families (COGs) proposed to act in DNA repair, many of which contain genes identified by our HMM models. The relationship between these two sets of families is uneven, with some of our HMMs spanning multiple COGs, some COGs spanning multiple HMMs, and some COGs including genes we believe unrelated to CRISPRs. COG0640, e.g., includes eight putative transcriptional regulators in A. fulgidus and five in M. jannaschii, but only MJ0379 and AF1869, one locus in each species, are CRISPR-associated; they encode the Csa3 protein of the Apern-type CRISPR system. These differences are not unexpected, considering the different clustering methods and search algorithms applied to unequal datasets in this case. Their work also introduced the RAMP superfamily[14], to which a number of Cas protein families belong. The proposed helicase, nuclease, and other domains for DNA repair metabolism may instead or in addition act in the processes of CRISPR physiology: mobilization, maintenance, processing, and addition of new spacer elements. To reflect this change in interpretation, we propose renaming the RAMP superfamily from repair-associated to repeat-associated mysterious protein, thus preserving the acronym currently in use. The groups of gene families that comprise the CRISPR/Cas subtypes appear to have traveled together through evolutionary time as discrete units. Even the core cas genes appear to have the same evolutionary history as their partner subtype-specific genes (Figure 2). The reasonable hypothesis that the Cas proteins interact (i.e., bind to, stabilize, regulate the expression of, cleave, modify, degrade, etc.) with the repeats in their DNA or expressed RNA form is supported by the observation of subtype-specific characteristics of the repeats such as repeat periodicity. Although as demonstrated in this study, CRISPR/cas loci of different subtypes can coexist within the same genome, phylogenetic reconstructions of Cas core proteins do not provide any evidence of switching between subtypes having repeat periodicities of 60, 61, and those with longer periodicities (Figure 2). The RAMP module and the RAMP-like Mtube subtype would appear to deviate from this pattern, showing varying degrees of independence from dedicated cas core genes and their associated repeat periodicities. It has been previously suggested that cas genes have undergone LGT events based on phylogenetic analyses and conservation of gene order [14], anomalous nucleotide frequencies [8], and the presence of multiple chromosomal CRISPR loci [3]. Our finding that the core cas genes belong to multiple CRISPR subtypes, each with its own sporadic distribution, indicates that this conclusion should be reexamined and reconfirmed. Indeed we have observed several lines of evidence that support the LGT hypothesis: (1) CRISPR/cas loci representing five different subtypes are found on plasmids (subtype Ypest in Legionella pneumophila Lens, subtype Dvulg in D. vulgaris, subtype Hmari in H. marismortui, subtype Ecoli in Photobacterium profundum, and both subtypes Mtube and Ecoli in T. thermophilus HB8). In the case of L. pneumophila Lens, a second, nearly identical copy of the locus is found on the chromosome. (2) In L. pneumophila Paris, by contrast, there is no trace of any gene with homology to the Ypest subtype genes or repeats found in the Lens strain, while an entirely different (untyped) CRISPR locus is found. Differences in CRISPR locus content have been observed between closely related strains of S. pyogenes, Listeria monocytogenes, and T. thermophilus. (3) Comparison of the Ecoli subtype loci from E. coli K12-MG1655 and E. coli O157:H7 EDL933 shows that while Cas1, Cas2, and the surrounding genomic region are nearly identical between K12-MG1655 and O157:H7 EDL933, this similarity does not extend to the rest of the Cas proteins in the cluster. For K12-MG1655, these proteins are most similar to those in Geobacter sulfurreducens, while for O157:H7 EDL933 they are most similar to those of Photorhabdus luminescens. (4) Cas1 proteins found in Porphyromonas gingivalis W83, Vibrio vulnificus YJ016 and Nostoc sp. PCC 7120 are fusion proteins, having a C-terminal Cas1 domain but also a reverse transcriptase domain similar to that found in group II introns. This may represent one mechanism used for mobilization in a subset of CRISPR loci. Clusters of cas genes and their associated repeats must maintain themselves in prokaryotic populations by reproducing and mobilizing themselves as fast as they are degraded. We see numbers of degenerate CRISPR/cas systems as well as profound differences in cas gene content between closely related species or strains. This is significant, because it implies that the process of replenishing genomes with intact CRISPR loci is frequent. We are inclined to believe that CRISPR/cas loci may, under certain circumstances, confer selective advantages to their host cells and, in these cases, stabilize the loci against degradation. We have yet to observe a single instance of a duplicated cas gene cluster on the chromosome(s) of any species. This is in contrast to selfish genetic elements such as transposons, which persist in a given lineage largely through redundancy. Plasticity with respect to the number of repeat copies, as well as the extensive differences in the spacers between repeats, is observed in CRISPR loci [2,12,15,23]. The finding that spacer sequences derive from foreign DNA, such as phage and transposons, suggests a defensive capacity for at least some instances of CRISPR system [12,13], but roles in replicon partitioning in the Archaea [1] and regulation of fruiting body development in M. xanthus [19,22] are also suggested. Correlation of repeat expression with CRISPR subtype is in order. Apern subtype repeats are expressed and processed in A. fulgidus and S. solfataricus [11]. Also expressed, in addition to their neighboring cas genes, are the Nmeni repeats of Streptococcus agalactiae (H. Tettelin and J. Dunning Hotopp, personal communication), the Mtube repeats of Staphylococcus epidermidis (S. Gill, personal communication), and the Hmari/Mtube/RAMP module region repeats of T. maritima (data not shown). Five separate markers from the Ecoli-type CRISPR array of G. sulfurreducens were up-regulated 2- to 3-fold when cells were grown with Fe(III) versus fumarate as electron acceptor [24]. We have characterized multiple distinct subtypes of CRISPR/cas loci and demonstrated profound differences in CRISPR system content between closely related strains and species. Beneficial roles may include defense of the host against foreign DNA [12,13] and regulation of the fruiting body development cycle by the DevR and DevS cas genes in the special case of M. xanthus [19,22]. These findings support an emerging model of CRISPR/cas systems. They appear to be portable adaptation modules for their host genomes. They are sufficiently unstable that degenerate forms are often seen and sufficiently mobile that multiple instances of LGT are apparent. Their repeat arrays consistently are among the most rapidly evolving loci seen in strain comparison studies, such that they are the basis of “spoligotyping” [23,25,26]. Both cas gene and repeat expression can be differentially regulated. They can be co-opted by their hosts for new regulatory systems, as seen for a pathway unique to Myxococcus in the interaction between the non-Cas protein FruA and Cas protein DevR. The adaptations they enable may be supplanted later by the evolution of more stable regulatory systems, but in the meantime they may be superbly useful in rapid adaptation, such as in the invasion of a new biological niche. Materials and Methods Identification of CRISPRs. CRISPRs were identified by three methods. Arrays of exact or near-exact repeats were readily detected by REPfind, a part of the REPuter package [27–29]. Sample sequences from known arrays were used to identify additional, smaller repeat clusters by BLASTN (http://blast.wustl.edu). Finally, regions suspected to have few and/or poorly conserved (degenerate) repeats, including regions near otherwise unexplained cas gene clusters, were examined manually with the dot-matrix homology visualization tool dotter [30]. Definition of CRISPR-associated (Cas) protein families. Cas protein families were identified from the construction of specific HMMs and subsequently deposited in the TIGRFAMs database (http://www.tigr.org/TIGRFAMs) [31]. The construction of HMMs involves refining multiple sequence alignments (also known as seed alignments), building HMMs from these alignments, exhaustively searching protein sequence databases, and selecting cutoff scores above which are found only true positives and below which no false negatives are detected. HMMs for Cas1–Cas4 were recognized among families previously designated “conserved hypothetical protein,” or were constructed for the first time, according to descriptions of these families by Jansen and colleagues [3]. All proteins encoded between or near identifiable cas genes were searched against a series of in-house databases of all available protein sequences and of prokaryotic genome sequences. Those that showed a pattern of matching numbers of similarly positioned proteins were investigated further as candidate new Cas protein families. For many of these families the process is iterative. Significant matches to the current model for the emerging family are found only near CRISPRs (see below) and/or previously identified cas genes. These new sequences are added to the family and realigned, and the revised HMM may then identify additional sequences. More distantly related sequences were distinguished from spurious matches by their contiguity to cas and CRISPR loci in other genomes, by the quality of the revised multiple sequence alignments, and by the improved search sensitivity of the HMM that resulted from adding these sequences to the seed alignment. Completed models were classified as new Cas protein families only if they did not overlap with other Cas HMMs, did not identify a sequence in a species that lacks CRISPRs, and if members of the family were found in at least four different species from at least three different lineages. Furthermore, members of these models had to be encoded adjacent or near to a set of CRISPRs and other cas genes. Iteration was halted rather than allow separate families to coalesce into one if the separate families showed substantially different domain architecture with only local sequence similarity, or if the separate families described separate genes recurringly found near one another in different genomes. The construction of Cas protein HMMs continued in this way until the boundaries of the loci were reached, where additional genes had identifiable non-CRISPR-associated functions, and/or their homologs in other genomes were no longer CRISPR-associated. All designated Cas family HMMs were searched routinely against comprehensive protein databases, which include eukaryotic sequences and CRISPR-negative prokaryotic genomes, to reconfirm specific association with repeats. CRISPR genome properties. The presence or absence of CRISPR/Cas systems (with both repeats and sets of Cas proteins that include Cas1) in general and of eight different CRISPR/Cas subtypes (Ypest, Ecoli, Nmeni, Apern, Dvulg, Tneap, Hmari, Mtube, and the RAMP module) are determined by evidence-based rules implemented in the Genome Properties system [32]. Genome Properties is a database system (http://www.tigr.org/Genome_Properties) that can collect both manually curated and automated rule-based assertions of the presence or absence of complex biological systems and their components. States of YES and NO were imported from the work of Jansen and colleagues [3] and corrected in one case (T. acidophilum to YES). The YES state is set for subsequent prokaryotic genomes if Cas1 is detected; repeats are examined subsequent to assignment of the state. The state “none found” is converted to “NO” only if repeats prove absent. Rules for the individual CRISPR/Cas subtypes are based on protein family assignments made by the sets of subtype-specific HMMs listed in Table 1 and on proximity to the specified core Cas proteins for each type listed in Table 3.
                Bookmark

                Author and article information

                Journal
                BMC Genomics
                BMC Genomics
                BMC Genomics
                BioMed Central
                1471-2164
                2013
                22 January 2013
                : 14
                : 47
                Affiliations
                [1 ]Institute of Medical Microbiology, German Centre for Infection Research, Justus-Liebig-University, D-35392, Giessen, Germany
                [2 ]Department of Genomic and Applied Microbiology and Goettingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August University Goettingen, Grisebachstrasse 8, D-37077, Goettingen, Germany
                [3 ]Bioinformatics Resource Facility, Center for Biotechnology, Bielefeld University, D-33549, Bielefeld, Germany
                [4 ]ICAR Research Complex for Goa, Ela, Old Goa, 403402, India
                Article
                1471-2164-14-47
                10.1186/1471-2164-14-47
                3556495
                23339658
                94123e14-bb43-4e90-b121-8860d16379f7
                Copyright ©2013 Kuenne et al.; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 7 September 2012
                : 15 December 2012
                Categories
                Research Article

                Genetics
                Genetics

                Comments

                Comment on this article