88
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      In Vivo mRNA Profiling of Uropathogenic Escherichia coli from Diverse Phylogroups Reveals Common and Group-Specific Gene Expression Profiles

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          INTRODUCTION To successfully thrive in the host environment during the course of an infection, pathogens have to rapidly adapt to the specific conditions encountered. Thereby, a key to understanding microbial pathogenesis lies in knowledge of which genes are expressed to initiate and maintain the infection and of the global impact of the host environment on the transcriptional profile of the pathogen (1). Urinary tract infections (UTI) are one of the most common bacterial infections worldwide, and most of them (over 80%) are caused by uropathogenic Escherichia coli (UPEC) (2). It is widely accepted that UPEC strains originate from the distal gut microbiota where they mostly behave as commensals (3), although UPEC strains are armed with extra virulence genes (4). Those virulence genes are often present on strain-specific pathogenicity islands (PAIs), which are clusters of virulence-related genes (5 – 7). PAIs are diverse in content and genome location and, as more sequence information of more examples of the islands accumulates, greater insights into their role in disease can be expected (8, 9). UTI is recognized as presence of the bacteria in urine (bacteriuria). During the course of infection, bacterial cells are attaching to human epithelial cells, utilizing chaperone usher (CU) fimbriae that contain adhesins on their tips (10). The prototypical CU type I fimbriae adhesion can lead to intracellular invasion of bladder epithelial cells (11). UPEC strains are known to enter the cytoplasm and form biofilm-like structures called intracellular bacterial communities (IBC) (12). After maturation of IBC, the UPEC cells can disperse into urine, or as part of the host response the infected epithelial cells may be exfoliated and released into urine. Exfoliated cells are replaced with transition epithelial cells, which may be as well invaded by UPEC, where it forms quiescent intracellular reservoirs (QIR) characterized by their persistence and antibiotic resistance (13). In vitro studies and various animal models have been valuable for exploring UPEC pathogenesis (14, 15) and have led to significant advances in understanding key pathogenicity mechanisms (16 – 22). Knowledge of UPEC gene expression during naturally occurring UTI will further add to the full understanding of microbial pathogenesis of this widespread bacterial pathogen. Indeed, investigation of complex transcriptional adaptation processes of UPECs to the human host is expected to uncover key regulatory components and to provide unique insight into bacterial pathogenicity (23). Furthermore, the identification of E. coli virulence genes associated with UTIs is potentially valuable in differentiating UPEC from nonuropathogenic E. coli and might lead to the introduction of virulence typing strategies into clinical microbiology. Today, the significant advances in next-generation sequencing technologies enable unbiased and very accurate quantitative annotation-independent detection of transcripts at high resolution (24). Furthermore, RNA sequencing (RNA-seq) can be used to extract genotype information from the cDNA on a single-nucleotide resolution level, providing profound insights into phylogenic relatedness. Although RNA sequencing studies have been widely used for quantitative and qualitative transcriptional profiling of various bacterial pathogens (24 – 30), the application of RNA-seq to determine global transcriptional profiles during the infection of the human host has remained very limited. In this study, we used strand-specific RNA-Seq to generate comprehensive in vivo transcriptional profiles of 21 UPEC stains causing symptomatic UTI in a cohort of elderly patients and gained profound insights into the conservation/variation of transcription patterns across UPEC isolates that exhibited a broad phylogenetic distribution. While most known UPEC virulence factors could be identified, comparison of the in vivo transcriptional profiles uncovered a set of genes that is specifically transcribed during the course of an infection and which cannot be inferred from analyzing genomes or from transcriptional profiles of UPEC isolates recorded under laboratory culture conditions. RESULTS Broad phylogenetic distribution of E. coli UTI isolates isolated from elderly patients. With the aim to record in vivo transcriptional profiles of UPEC stains, urine samples were collected from outpatients with symptomatic UTI prior to antibiotic treatment. Overall, 21 urine samples were included in this study. All of them were culture positive on MacConkey agar plates, with more than 106 E. coli CFU/ml urine in pure cultures, and microscopic inspection of urine sediments revealed the presence of massive numbers of neutrophils (>100/µl). The 21 patients were mainly elderly (mean age above 60 years, with only 4 patients being younger than 60 years), 8 were male, and 13 were female. RNA isolation procedures and strand-specific Illumina-based RNA sequencing of bacterial mRNA were performed, and the raw sequence output after the removal of reads that mapped to the human genome consisted of 61.01 million reads. Thus, on average, 2.9 million reads were retrieved from each of the 21 samples. In accordance with the finding that the gene content between pairs of E. coli genomes may diverge by more than 30%, the range of gene numbers to which those reads mapped was between 3,848 and 4,972. In E. coli, <3% of nucleotide divergence is found among conserved genes in the various genomes (6). This high degree of homogeneity allows the establishment of phylograms that are built upon sequence variations. Previous studies have identified five major phylogenetic groups, (B2, B1, D, A, and E), corresponding to E. coli strains with distinct capability to cause disease and to inhabit various ecological niches (31 – 36). Figure 1 depicts the phylogenetic distribution of previously sequenced E. coli isolates that have been grouped into the five phylogenetic E. coli groups. This tree is based on sequence variations of 336 genes (for those genes, at least 80% sequencing coverage across the 21 UTI isolates was detected), which allowed us to use the genotype information from the RNA-seq data of the E. coli genomes to assign the 21 UTI isolates of this study to the clusters within the phylogenetic tree (Fig. 1). Reflecting the fact that our study group consisted mostly of elderly patients, we found a broad distribution of the 21 UTI-associated isolates between the phylogenetic groups. A total of 43% of the 21 isolates belong to the virulent E. coli strain phylogroups B2 and D (B2, 33%; D, 10%), whereas the others are distributed in the B1 (38%) and A (19%) phylogroups. FIG 1  Phylogenetic tree of 54 previously sequenced strains and the 21 clinical isolates from this (in italic) work based on sequence variation within 336 genes. Phylogenetic groups are indicated based on previous reports (34, 35). The numbers show the bootstrapping values as provided by RaxML. Commonly transcribed genes of the E. coli UTI isolates exhibit a conserved expression profile. With the aim to uncover the full extent of the in vivo gene expression profile of the 21 clinical E. coli isolates, we mapped all obtained Illumina sequencing reads to a list of 12,331 nonredundant E. coli genes. This list of genes was generated by the comparative genomic analysis of 54 previously fully sequenced E. coli genomes (see Materials and Methods). The entire list, including ortholog identifiers (IDs) as well as the expression values of the 21 UTI samples, is provided in Data Set S1 in the supplemental material. This list includes 2,129 genes shared by all 54 strains and 10,202 genes that are absent in at least one of the 54 strains. Among the latter, 3,257 genes were found in only one of the 54 published genomes as singletons. Only very few genes having homologs in all 54 sequenced E. coli isolates were not transcribed in any of the 21 isolates under in vivo conditions, indicating that expression of most of the core genome is relevant for bacterial replication in the human urinary tract. Furthermore, we found a large set of overall 2,589 genes that were commonly transcribed in all isolates during in vivo conditions, which—depending on the genome size of the isolates—accounts for 52% to 67% of all transcribed genes within one isolate. As depicted in Fig. S1 in the supplemental material, those commonly expressed 2,589 genes appear to be unregulated or constitutively expressed, as the overall variation of the expression profiles among the isolates was low and the genes were expressed at a generally high level independently of their phylogenetic group specificity. As expected, many of these genes correspond to genes required for the maintenance of basic cellular functions, such as DNA repair, ATP synthesis, aminosugar metabolism, and protein transport (see Table S1 in the supplemental material). Since we found only a low variation in the expression levels of the genes commonly transcribed in all 21 E. coli isolates at the time of mRNA sampling, hierarchical clustering based on their transcriptional profiles did not reveal specific and distinct clusters. We also performed matrix-assisted laser desorption ionization–time of flight (MALDI-TOF) mass spectrometry biotyping to elucidate whether protein fingerprints might uncover clusters that serve for the identification of phylogenetic relatedness. MALDI-TOF mass spectrometry (see Fig. S2) correctly classified our UTI E. coli isolates on the species level. However, a dendrogram based on Minkowski distances and group averages did not reveal distinct subgroups within our isolates that would correlate to the previously identified phylogenetic groups B2, B1, A, and D. This may reflect the fact that MALDI-TOF mass spectrometry covers mostly housekeeping proteins, e.g., the ribosomal proteins, and therefore is ill suited to discriminate phylogenetic relationships. The in vivo gene expression profile of the E. coli UTI isolates correlates with phylogenetic group clustering. Mapping of all obtained Illumina sequencing reads to the list of 12,331 nonredundant E. coli genes revealed—apart from the 2,589 commonly transcribed genes (see above)—a large fraction of genes (6,305 genes) that were expressed in at least one of the 21 UTI strains (see Data Set S1). Remarkably, clustering of the in vivo transcripts based on principal component analysis (PCA) of the 21 UTI isolates (Fig. 2), including commonly transcribed genes as well as those of the flexible genome, compared very well to that of phylogenetic clustering based on the single nucleotide polymorphism (SNP) profile (Fig. 1). The expression profile of the 21 UTI samples clustered into three main groups that represented the B2, D, and A/B1 phylogenetic groups. Of note, clustering became even more accurate and well separated when only the expression of genes of the flexible genomes was included in the analysis (data not shown). These results are in agreement with previous reports (37, 38) and clearly demonstrate that the presence of group-specific gene repertoires, and not a difference in overall gene expression profiles, impacts on clustering of the UTI isolates into the phylogroups. FIG 2  Clustering of the in vivo transcripts of the 21 UTI isolates based on principal component analysis (PCA). Clustering clearly reflects phylogenetic relatedness as the clinical isolates grouped according to their affiliation to the B2, D, and A/B1 phylogroups. We also performed a de novo assembly of reads from the 21 isolates that did not match any of the 54 sequenced genomes, which resulted in the identification of 158 potential genes, 48 of which are organized in operon structures. A total of 105 of the genes have homologs in E. coli, and 53 have homologs in other Enterobacteriaceae (see Table S2). In vivo mRNA expression profiling of known UPEC virulence factors. Many of the genes found to be expressed in vivo in the 21 UTI isolates included known key E. coli virulence factors. Although we sampled voided bacteria, which are clearly distinct from attached and biofilm-grown bacteria, genes responsible for adhesion to the uroepithelium, e.g., type I fimbriae (fim) (16), P fimbriae (pap), F1C/S fimbriae (foc and sfa), were found (Table 1). However, we did not find a uniform expression of any of those common adhesion-related genes. Whereas no or only very low expression of fimA, whose expression has been demonstrated to enhance E. coli virulence in the urinary tract (16), could be detected in 13 UPEC isolates in this study, the fimA gene and the subsequent operon was highly expressed in 8 isolates. Additionally, P fimbriae and F1C/S fimbriae-encoding genes were expressed in a subset of isolates (5 and 3 isolates, respectively). Interestingly, F1C/S fimbriae gene expression was exclusively found in isolates which grouped to the phylogenetic B2 cluster. TABLE 1  UPEC virulence genes present in the 21 clinical isolates Result by phylogenetic group and UTI isolate no. c D A B1 B2 24 8 10 27 26 21 23 15 1 5 U3 U5 4 17 11 3 9 19 14 25 2 Adhesion      fim ++ ++ ++ ++ − − + + + + ++ ++ ++ + ++ + + + + + +      pap + − − + − − − − − − − − − − + ++ − ++ − − −      F1C/S − − − − − − − − − − − − − − + − ++ + − − − Iron acquisition      ent ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++      fep ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++ ++      iuc + − − − ++ ++ ++ ++ + ++ ++ − ++ − ++ ++ − ++ ++ ++ −      iro − − + − − ++ ++ ++ ++ ++ ++ − − − ++ − ++ ++ − − −      irp ++ − − − − ++ + − − ++ ++ − − ++ ++ + ++ ++ + ++ ++      chu ++ ++ + + − ++ − − − − − − − − ++ ++ ++ ++ ++ ++ + Capsule      kps ++ ++ − ++ − + − − − − − − − + ++ ++ ++ ++ ++ ++ ++ Toxins      cnf1 − − − − − − − − − − − − − − ++ − ++ − − − −      hlyA − − − − − − − − − − − − − − ++ − ++ ++ − − −      picU − − − − − − − − − − − − ++ − − − ++ ++ + ++ −      sat − − − − − − − − − − − − − − − ++ − ++ ++ ++ −      vat − − + − − ++ − − − − ++ − − − ++ ++ ++ ++ − − ++      usp − − − − − − − − − − − − − − ++ ++ ++ ++ ++ ++ ++      pks a − − − − − − − − − − − − − − ++ − ++ ++ − − −      colV b − − − − − ++ − ++ − ++ ++ − − − ++ − − − − − − a pks island genes c2471 to c2451. b Colicin V transport genes cvaAB deduced from de novo analysis (see Table S2 in the supplemental material). c The presence or absence of the gene/operon is indicated as follows: −, no reads detected (nRPK value from 0 to 1.5); +, reads detected with low values (nRPK from 1.5 to 2.0) or partial operons were detected; ++, genes with nRPK values of >2. Genes encoding iron acquisition systems were found to be widely expressed in vivo. The enterobactin and its transport system-encoding genes (ent and fep) were expressed in all UTI isolates without exception, whereas expression of aerobactin (iuc), yersiniabactin (irp), and salmochelin (iro) genes was less uniform. Expression of the heme-mediated iron acquisition system (chu) was present in 100% of isolates clustering with the D and B2 phylogenetic groups and in 75% of the isolates clustering with group A but absent in those clustering with group B1. Capsular polysaccharide expression was observed in 100% of isolates clustering with group D and B2, in 50% clustering with group A, and only partially in one isolate of group B1. The expression of extracellular toxin-encoding genes in vivo was less frequent. Overall extracellular toxin expression was most frequent in those UTI isolates that clustered with strains from the B2 phylogenetic group. cnf1 and hlyA expression was observed in only two and three isolates, respectively. Expression of genes encoding serine protease autotransporter PicU was present in five isolates, whereas the gene encoding the serine protease autotransporter Sat was expressed in four isolates, all of them clustering with the B2 phylogenetic group. The vacuolating autotransporter toxin-encoding gene vat was expressed in 8 isolates, and ups gene expression was found exclusively in isolates that clustered with group B2 isolates. The genes encoding the transport system of colicin V were detected in 5 isolates, and the clb operon encoded on the pks island (39) was observed to be expressed in 3 isolates, again clustering with strains of the B2 group. In vivo expression profiling of small regulatory RNAs. RNA-seq profiling enabled us to investigate expression of small regulatory RNAs (sRNAs), which have been assigned central roles in virulence and environmental fitness (40, 41). Eleven sRNAs were identified that exhibited high in vivo expression levels in all or most of the 21 clinical UTI isolates (Fig. 3). FIG 3  Expression profile of the small regulatory RNAs in the 21 UTI isolates and the 4 UTI isolates cultivated in vitro. The genes (vertical) are hierarchically clustered using Pearson distances, and the samples (horizontal) are clustered according to Spearman rank correlation. The histogram describes the correlation of the color to the nRPK value of absolute expression. Among them, ryiA (glmZ) and csrB exhibited the highest in vivo expression levels. The sRNA RyiA (GlmZ) activates glmS expression. GlmS synthesizes glucosamine-6-phosphate (GlcN-6-P) and thus delivers precursor molecules for the biosynthesis of peptidoglycan and lipopolysaccharides (LPS), which are essential elements of the Gram-negative bacterial cell wall (42, 43). Another sRNA that was found to be highly expressed during UTI was csrB. Both sRNAs csrB and csrC are modulatory components of the carbon storage regulatory (Csr) network. They contain multiple CsrA binding sites, which permit them to sequester and antagonize CsrA, a pleiotropic regulator of carbon metabolism (44 – 46). Transcription of these two small RNAs is regulated by the BarA/UvrY two-component signal transduction system (TCS) in E. coli or by homologous systems such as GacS/GacA in other bacteria (47). The Csr system (or the homolog RsmA/RsmZ) is present in many eubacteria and is known to be involved in mediating adaptive physiology, timed virulence trait expression in animal pathogens (48, 49), and biofilm formation (50, 51). Recently, it was shown to interact with the stringent response regulatory system (52). Although other sRNA were also identified to be expressed during in vivo growth, their overall expression levels were often lower than those observed in four representatives of our clinical isolates that were cultivated in vitro under rich medium conditions until late exponential growth phase. Apparently, a large number of those highly in vitro-expressed sRNAs serve the adaptation to stationary phase of growth (Fig. 3). Among those, we found micA, a negative regulator of ompA (53), ryhA (arcZ), and rprA, encoding sRNAs that increase the translation of the stationary sigma factor RpoS (54, 55). Their expression, as well as the expression of sroC and ryeB, has been associated with stationary phase of growth (56, 57). Additionally, the products of rprA, isrA (mcsA), and omrA were strongly expressed. Those sRNAs have been shown to negatively regulate the translation of CsgD, the major transcriptional regulator of E. coli curli biosynthesis (58, 59). Identification of infection-relevant gene expression profiles in the E. coli UTI isolates. In addition to known virulence factors, we aimed at identifying infection-relevant genes that are commonly expressed in UPEC isolates in vivo. We therefore cultivated 4 of the UTI isolates (isolates UTIU3 and UTIU5 clustering with phylogroup B1, UTI24 clustering with group D, and UTI9 clustering with group B2) in vitro under rich medium conditions and recorded the transcriptional profiles. A total of 202 genes were found to be upregulated under in vivo conditions in the 4 strains, and all of those genes have been demonstrated to be expressed in all 21 UTI isolates in this study under in vivo conditions. A detailed list of the 202 commonly and exclusively in vivo expressed genes is provided in Table S3 in the supplemental material (detailed data on all differentially expressed genes is given in Data Set S2A [upregulated genes] and S2B [downregulated genes] in the supplemental material). Whereas only 23 hypothetical or conserved hypothetical genes were found, use of the systematic functional annotation provided by Gene Ontology revealed that 20% of the genes belonged to functional groups involved in general biological processes such as ATP synthesis and catabolic processes as well as transcription, translation, and DNA replication and repair. Furthermore, many genes were found to be involved in rRNA and tRNA processing, indicating that the bacteria rapidly grow in the human urinary tract. Consistent with the fact that main carbon sources for E. coli during UTI are peptides and amino acids, genes that belong to biological processes of proteolysis, protein transporters, carbohydrate metabolism, and fatty acid biosynthesis were represented. We also found that genes encoding enzymes of the pyruvate dehydrogenase complex were highly expressed. Another large group of genes commonly expressed in vivo were genes involved in the regulation of bacterial cell shape and in bacterial stress responses, such as responses to toxic substances, including antibiotics. Furthermore, we found an in vivo overexpression of ampG encoding a peptidoglycan permease, which was shown to be involved in evasion of the host innate immune system during UTI (60). We also found gidA encoding the tRNA uridine 5-carboxymethylaminomethyl modification enzyme (61) among the commonly in vivo expressed genes. gidA is known to impact on the posttranscriptional level on a number of virulence factors in Pseudomonas syringae (62), Aeromonas hydrophila (63), Shigella flexneri (64), Streptococcus suis (65), Streptococcus pyogenes (66), Salmonella enterica serovar Typhimurium (67), and E. coli (68). Of note, the gidA gene was also shown to be upregulated in the majority of patients’ samples from a previous study on UPEC transcriptomics (69) and in an earlier murine UTI in vivo gene expression study (14). Overall, our transcriptional data are remarkably consistent with those previous reports (14, 69) and with the transcriptional profile of E. coli isolated from patients with asymptomatic bacteriuria (ABU) (70). Expression of genes involved in nitrate/nitrite metabolism and nitric oxide (NO) protection, upregulation of iron acquisition systems, and genes involved in carbohydrate and amino acid metabolism were commonly observed (14, 69, 70), reflecting bacterial adaptation to the growth conditions encountered in the environment of the urinary tract. Interestingly, for 3 (carA, carB, and argC) out of the 202 commonly highly expressed in vivo genes in our study, it was shown that their inactivation poses a competitive disadvantage to the respective mutants in the mouse urinary tract (71). These results clearly suggest that their expression is crucial for growth in the urinary tract. Identification of genes that are exclusively expressed in the E. coli B1 and A phylogenetic groups. To evaluate whether the UTI-associated isolates that group with B1 and A express a distinct set of genes potentially relevant for the infection process, we extracted from the list of genes that were found to be differentially regulated among the 21 UPEC isolates those that were specifically expressed in the 12 phylogroup A/B1 isolates. We identified 142 genes that were expressed at a significantly higher level in the 12 phylogroup A/B1 isolates (see Fig. S3 and Table S4A in the supplemental material), compared to all other 9 isolates. Interestingly, 27 (19%) of these genes were associated with utilization of alternative carbon sources, with in particular the complete set of the 12 genes required for phenylalanine degradation into succinyl coenzyme A (CoA) (tynA, feaB, paaKEACBGZJFH), indicating that those isolates have access to sufficient amounts of phenylalanine in the urine. Of note, mutations in aroA have been used to construct attenuated strains of various Gram-negative bacteria, including E. coli (72). Thereby, the attenuation is due to the inability of aroA mutants to synthetize chorismate, which is a precursor of important biochemical intermediates such as indole and aromatic amino acids, many alkaloids, and other aromatic metabolites, as well as folate and 2,3-dihydroxybenzoic acid used for enterobactin biosynthesis. The availability of aromatic amino acids in the urine may not only enable E. coli growth on 2-phenylalanine but also may save chorismate for iron chelator biosynthesis as a crucial virulence trait. Interestingly, among the in vivo expressed genes that were found to be enriched in the 12 phylogroup A/B1 isolates, we also found iroC and iroD involved in transport and procession of the siderophore salmochelin. iroC was also upregulated in vivo compared to LB cultures in one of two isolates, clustering with group B1, for which an in vitro transcriptional profile was recorded. These results indicate that the siderophore may play an important role in iron acquisition within the subgroup of UPEC isolates that cluster in the A/B1 phylogroup and that lack the common UPEC-associated virulence gene expression. We could also detect 13 (9%) genes encoding fimbrial adhesins mostly described as functional but cryptic, including the ycb operon (ycbRSTUVF), part of the yra operon (yraH, yraJ, and yraK) (73, 74), and genes encoding a CS1-type fimbrial structure (10CE_3624, -25, -26, and -27) that is usually associated with enterotoxigenic E. coli (75). The enrichment in fimbriae genes in the group of the 12 studied A/B1 isolates could reflect a characteristic increased adhesion capability (76). We also observed expression of Rhs element genes. Many bacteria contain all or part of 5 Rhs elements: RhsA, -B, -C, -D, and -E, scattered around the chromosome. Each Rhs region contains a 3.7-kb GC-rich DNA sequence that is 99% identical from one element to another. These high-identity levels between Rhs proteins was proposed to mediate major intraspecies chromosomal rearrangements, hence their name (which stands for “recombination hot spot”) (77). However, high conservation of intact rhs main genes (rhsA, -B, -C, -D, -E) also suggested that they could contribute to a function subjected to selective pressure (78). Intriguingly, rhs genes are not expressed to a detectable extent during routine cultivation, and the conditions leading to Rhs expression have not yet been elucidated (79). Our mRNA expression analysis demonstrates that some Rhs elements are specifically expressed in vivo in all 12 UPEC isolates belonging to the B1/A phylogroup. Of note, recent studies suggested that expression of Rhs elements are associated with bacterium-host or bacterium-bacterium interactions, suggesting that such functions could contribute to UTI (80, 81). Their expression has furthermore been associated with toxin-antitoxin (TA) activity and to be potentially delivered through a type 6 secretion apparatus delivering effectors both in prokaryotic and eukaryotic prey cells (82, 83). Some TA systems have recently been shown to be important for colonization of the bladder (yefM-yoeB and tomB-hha) and survival within the kidneys (pasTI, previously named yfjGF) in a murine UTI model (84). In this study, the chpAR, yafQ-dinJ, and hicBA TA systems were found to be highly expressed in vivo, specifically in the isolates clustering with phylogroups A/B1. Identification of genes that are exclusively expressed in the E. coli from the B2 phylogenetic group. Besides identification of genes specifically expressed in the isolates clustering with the B1 and A phylogroups, we also identified 389 genes that were specifically expressed in strains clustering with the B2 phylogroup. A total of 208 out of the 389 genes encode hypothetical or conserved hypothetical proteins, and 102 are annotated as encoding putative proteins (see Table S4B in the supplemental material). Apart from the well-described virulence genes, such as sat, encoding the secreted autotransporter toxin (85), or usp, encoding the uropathogenic-specific protein (86), as well as yadC, yadN, and yfcPQU, encoding putative fimbria-like proteins (73, 87), we found a large number of genes encoding transporter and secretion systems. We found genes encoding components of type II general secretion pathways yheBDK and hofDFGHIK, also annotated as gsp genes in the gspC-O operon involved in secretion of endochitinase yheB (chiA) (88). Secreted chitinase is increasingly recognized as a virulence factor of pathogenic bacteria infecting mammal host (89). We also found that components of the hypothetical type VI secretion pathway (encoded by APECO1_3694, 3695, 3696, 3698, 3702, 3705, 3711, and 3712, E. coli APEC O1:K1:K7 gene IDs) were expressed in vivo. Furthermore, a large group of genes encoding various transport systems, like yjcTU encoding a d-allose ABC transporter and a putative iron compound ABC transporter encoded by APECO1_3384 to APECO1_3389, as well as a B2 phylogroup-specific expression of phosphotransferase systems (PTS) responsible for transport of sugars into the bacterial cell, were identified. In contrast to the isolates clustering with the A/B1 phylogroups that exhibited extensive upregulation of the phenylalanine degradation pathway, isolates clustering with the B2 group seem to use various sugars as main carbon and energy sources. DISCUSSION A key to understand microbial pathogenesis is to unravel how the host environment impacts on the global gene expression pattern of a pathogen and to identify the gene repertoire whose expression is essential for the initiation and maintenance of an infection. In this study, we applied massive parallel cDNA sequencing (RNA-seq) to provide unbiased, deep, and accurate insight into the nature and the dimension of the uropathogenic E. coli gene expression profile during an acute infection within the human host measured on bacteria present in voided urine. It is essential to indicate here that complex bacterial communities are present in the course of infection. In the current sampling procedure, we analyzed mainly planktonic bacteria, probably mixed with IBC from exfoliated epithelial bladder cells. It is possible that transcription profiling of selected adhesive cell population or IBC only would result in different gene expression results. With a total of 21 in vivo transcriptomes, this study includes a large number of bacterial strains studied in respect to pathogenic E. coli gene expression following naturally occurring symptomatic human UTI. We applied RNA-seq to detect global transcriptional profiles independent of genome annotations and analyzed the in vivo transcriptomes to their full extent, including flexible genomic elements and expression of small regulatory RNAs. Furthermore, we identified single nucleotide polymorphisms (SNPs) in the bacterial isolates and used their cumulative differences to provide a large number of discriminators. These discriminators represent typing markers to distinguish bacterial isolates and to group the 21 UPEC isolates to one of the four main phylogenetic groups, A, B1, B2, and D. Our findings on gene expression profiles in the urine of patients suffering from a UTI are generally consistent with data generated using murine models and a previous array-based transcriptome study of gene expression during a human UTI (14, 69, 70). When comparing the in vivo gene expression profiles to those recorded under laboratory medium conditions, we found that E. coli adapts to the conditions encountered within the human host by expressing genes required for rapid replication, acquisition of iron, attachment to the uroepithel, and evasion of the immune system, while variably expressing virulence genes. Analysis of sRNA expression revealed consistent expression of sRNA involved in cell wall biosynthesis and integration of membrane proteins (glmZY) (42, 43) and in mediating adaptive physiology and timed virulence (csrBC) (44 – 46, 48, 49), underpinning the role of sRNAs in bacterial adaptation processes. Although it is widely accepted that UPEC strains originate from the distal gut microbiota, they seem to be capable of colonizing the urinary tract and to cause symptomatic infections of cystitis and pyelonephritis, because they are armed with extra virulence genes that distinguish them from E. coli commensals (4). Several studies have demonstrated that the phylogroups differ in respect to the presence of virulence factors and ecological niches, and UPEC isolates have previously been found to be more prevalent in group B2. In line with this, we found 7 UPEC isolates that grouped with the B2 phylogenic group, and they expressed several virulence genes in vivo that have been associated with UPEC strains exhibiting full-pathogenic potential. Nevertheless, and in accordance with previous studies on atypical UTI patient populations (90 – 92), in our study, which was performed on samples collected mainly from elderly patients, as many of 12 out of the 21 UPEC isolates analyzed were assigned to the A and B1 phylogenetic groups, which predominate among commensal E. coli. We found that E. coli isolates that have been assigned to the four phylogroups share a large general gene expression profile, overall 2,589 genes were commonly transcribed in all isolates during the in vivo conditions, which—depending on the genome size of the isolates—accounts for 52% to 67% of the transcribed genome of the individual isolates. This conservation of a large part of the genome expression might account also for the finding that MALDI-TOF mass spectrometry, which probably corresponds to more- or less-conserved housekeeping proteins, does not allow a robust discrimination into the previously identified phylogenetic groups B2B1, A, and D. Although the 21 isolates share a large general gene expression profile, they do express clearly distinct flexible genomes. We found a strong correlation between the E. coli in vivo expression of the flexible genome and the genetic background of the isolate. However, as has been described before (37, 38), this correlation was dependent on the acquisition of group-specific gene repertoires in the flexible genomes rather than on a difference in their expression profile, possibly reflecting their evolution in distinct niches. Not only did our study identify previously described virulence-associated genes that were exclusively expressed in the 7 UPEC isolates clustering with group B2, but we also identified a novel set of genes overrepresented in those isolates. Among those, we found a large number of genes encoding transporter and secretion systems, indicating that they play a role in pathogenicity of B2 group isolates. Furthermore, we identified a set of 142 genes whose expression was demonstrated to be specifically enriched in the 12 isolates that clustered with the A/B1 phylogroups, including genes encoding phenylalanine degradation pathway, a siderophore, fimbrial adhesins, and Rhs elements. As more examples of in vivo transcriptional profiles accumulate, greater insights into the role of new genes involved in microbial pathogenicity can be expected. However, further investigations are required to unravel the specific impact of novel virulence-determining factors in the establishment and maintenance of the disease. Thereby, the application of in vivo RNA-seq seems to be particularly appropriate, as it affords detailed quantitative and qualitative sequence information that is independent of genome annotations and thus allows the establishment of full transcriptional profiles, including flexible genomic elements and expression of small regulatory RNAs. Furthermore, knowledge of SNPs as identified by the use of RNA-seq enables highly resolving phylogenetic grouping of clinical isolates and thus provides a basis for further global phenotypic-genotypic correlation studies. MATERIALS AND METHODS Ethical statement. Urine samples were collected from 21 outpatients with symptomatic urinary tract infections and subjected to bacterial RNA extraction procedures. Samples were collected according to the standards of the Declaration of Helsinki. The sample provided for this research was subtracted from the samples collected for routine microbiological tests, which are made on a regular basis; therefore, no additional procedures were carried out on the patients. Samples were analyzed upon informed consent from the patients. Bacterial RNA extraction and Illumina-based RNA sequencing. Urine samples (approximately 20 ml) were mixed with RNAprotect reagent (Qiagen), incubated for 15 to 30 min at room temperature, and centrifuged for 15 min at 4,000 × g at 4°C, and the pellet was frozen at −70°C. RNA isolation was performed using the RNeasy minikit (Qiagen) according to the manufacturer’s instruction with some modifications, and the DNA was removed by the use of a DNA-free kit (Ambion). Enrichment for bacterial RNA was achieved by using the MicrobEnrich kit (Ambion) according to the manufacturer’s instructions. Four UTI-associated isolates were also cultured in vitro in LB medium. RNAprotect reagent (Qiagen) was added to 3 ml of LB culture following growth to late exponential phase. All of the samples were treated for bacterial RNA enrichment. After depletion of rRNA from the samples, total RNA was subjected to a commercial capture and depletion system (MICROBExpress bacterial RNA enrichment kit; Ambion), strand-specific bar-coded cDNA libraries were generated as described (51), and all samples were single-end sequenced on an Illumina GenomeAnalyzer-IIx at a 36-bp read length. Traces of human reads were removed from the raw sequence output by mapping all reads to the latest human genome release, GRCh37. Mapping was performed with Bowtie (93), allowing for maximal 2 mismatches per read. The sequence output after human read removal consisted of 61.01 million reads. E. coli reference sequences. We used the genomic sequences of 54 E. coli isolates that were available for download from GenBank/EMBL (September 2012) as a reference to map all Illumina reads obtained in this study. The 54 E. coli genomes contain 252,623 genes, which give an average of 4,678 genes per strain (more details are presented in Table S5 in the supplemental material). With the aim to collapse those genes into gene families and define the genes present in all genomes, we first extracted all coding sequences (CDS) from the corresponding genomes. We then blasted the protein sequences found in all genomes against each other using BLASTP (94), discarding hits with <90% length and 50% sequence identity. Only if a gene product had a maximal reciprocal set of homologs in all other strains, 54 in total, the corresponding gene was considered “core”; otherwise, it was considered “flexible.” Flexible CDS that had homologs in 53 or 52 of the 54 E. coli genomes were reevaluated. The set of core genes detected in the reciprocal Blast search comprised 1,719 CDS, while there were an additional 363 CDS assigned to the core genome, summing to 2,082 core CDS (see Data Set S1 in the supplemental material). Apart from the 2,082 core CDS, we also identified 10,202 flexible CDS, including 3,257 singletons. Among the 54 completed E. coli genomes considered here, O26:H11/AP010953, O111:H/AP010960, and O103:H2/AP010958 exhibited the highest number of annotated small RNAs. We extracted the genomic sequences of 70 noncoding RNAs (ncRNAs) from the O26:H11 genome and performed BLASTN searches against each of the 54 genomes in order to define how many of these ncRNAs are present in all E. coli genomes. A total of 47 ncRNAs (45 small RNAs and 2 ncRNA, rne5, an RNase 5′ untranslated region [UTR] element, and Alpha_RBS, a ribosomal binding site of alpha operon) were found in all 54 genomes, and 41 of them were expressed at least in one of our clinical isolates. These ncRNAs were included in the core genome that consisted of 2,129 (2,082 CDS and 47 ncRNAs) genes. Finally, the sum of core (2,129) and flexible (10,202) genes amounted to 12,331 genes. The data representing the compiled 12,331-gene list, the orthologous gene IDs (with sequence length composition and percentage of identity), and the gene expression levels of each of 21 samples are presented as Data Sets S1A and S1B in the supplemental material. Mapping and gene expression profiling. The raw Illumina sequence reads (36-bp single end) were first split according to their bar codes using the fastq-mcf script of the ea-utils package (95), and then the bar code sequences were removed. We used the bowtie-build module in the Bowtie package (93) to build an indexed reference based on the 12,331 E. coli genes found in the 54 reference genomes as defined in the previous step. Mapping to the reference was performed using Bowtie with options “-m 1 -best -strata” to allow only uniquely mapping hits and avoid uncertainties regarding repeat regions and ribosomal genes. Finally, the read counts per gene (RPG) were recorded for each annotated gene and were used as an input for differential gene expression calculations with the R package DESeq (96). Briefly, the RPG data were normalized for variation in library size/sequencing depth by using the estimateSizeFactor function of DESeq. Differentially expressed genes were identified using the nbinomTest function based on the negative binomial model. Genes were considered to be differentially regulated only if their absolute logarithmic fold change over the control was higher than 1 at a false discovery rate of a maximum 5% (Benjamini and Hochberg P value correction provided in DESeq). In those clinical samples where no technical replica was sequenced, the uncorrected P values at 5% cutoff were used instead of the corrected ones. De novo assembly. All reads that did not map to the 12,331 E. coli genes were used as input for de novo transcriptome assembly with Velvet (97). We used a wide range of k-mers, 27 to 37, and a minimal transcript length of 100 bp. The assembled transcripts were blasted against all microbial genes downloaded from the MBGD Database (98) using a minimal hit length of 100 bp and sequence similarity higher than 90%. After removing the ribosomal gene hits, we identified 156 additional nonredundant genes. Phylogenetic tree. A consensus sequence for overall 336 genes (that had at least 80% sequencing coverage across the 21 UTI isolates) was generated by the use of the mpileup option in the SAMtools package (99). The corresponding orthologous gene sequences extracted from the 54 E. coli genomes were subsequently included. The sequence redundancies and gaps in sequence coverage were removed, resulting in a 2.3-Mb multi-Fasta file used for multiple alignment with Clustal Omega (100). The alignment was the subject of further refinement with RaxML (101), performing 500 bootstrapping steps and testing 50 trees. The consensus tree was drawn with Dendroscope (102). Gene ontology terms. We downloaded the current UniProt Gene Ontology (GO) knowledgebase (103). Using custom Perl scripts, we mapped the gene locus IDs (in KEGG format) to their UniProt identifiers and extracted the relevant GO IDs. The GO ID lists were summarized using the QuickGO browser (104). MALDI-TOF mass spectrometry biotyping. Intact cell smears of 19 E. coli isolates (for two patient samples, no bacterial cultures were preserved) were prepared in 10 biological replicates on MALDI target plates (MSP 96 polished steel target; Bruker Daltonics, Bremen, Germany) by following standard procedures. The air-dried smears were overlaid with 1 µl of saturated alpha-cyano-4-hydroxycinnamic acid matrix solution. E. coli DH5α bacterial test standard (Bruker Daltonics) was used for external calibration. Bacterial profile spectra were acquired in duplicates using a MicroflexLT MALDI-TOF device (Bruker Daltonics) for analysis in the mass range between 3 and 15,000 m/z with the Biotyper 3.1 software (Bruker Daltonics). In a quality-control step, spectra characterized by excessive noise and/or Biotyper scores indicating unreliable identification (<1.7) were excluded from our profile spectra library. We then generated reference spectra of each strain from the remaining 322 profile spectra using Biotyper MSP generation standard settings (105), yielding reference spectra for classification of our closely related E. coli strains. In a further quality-control step, we validated that our E. coli strains clustered together with the 11 E. coli strains among the more than 4,000 strains in the Biotyper database. The 19 strain reference spectra were clustered based on Minkowski distances and group averages. Nucleotide sequence accession number. The sequencing data have been submitted to SRA under the project accession no. SRP029244. SUPPLEMENTAL MATERIAL Data Set S1 (A) List of 12,331 non-redundant genes used in this report. The data set comprises the gene ID, the number of the orthologs, the gene name if applicable, the product name and the nRPK values for each gene in each of the 21 UTI isolates. (B) List of 12,331 non-redundant gene IDs and their respective ortholog IDs. Each ortholog entry indicates the gene ID, as well as the length and sequence identity. Download Data Set S1, XLS file, 0.1 MB Data Set S2 (A) List of the genes that were significantly up-regulated in at least one of the four UTI isolates during UTI when compared to growth in vitro. (B) List of the genes that were significantly down-regulated in at least one of the four UTI isolates during UTI when compared to growth in vitro. Download Data Set S2, XLS file, 0.1 MB Figure S1 Expression of the (2,589) commonly transcribed E. coli genes within the 21 clinical isolates. The genes (vertical) are hierarchically clustered using Pearson distances, and the isolates (horizontal) are clustered according to Spearman rank correlation. Download Figure S1, TIF file, 0.8 MB Figure S2 MALDI-TOF mass spectrometry biotyping. A dendrogram was calculated based on Minkowski distances and group averages. Download Figure S2, TIF file, 0.1 MB Figure S3 Expression of phylogenetic group A/B1-specific genes. Only those genes that were expressed in >70% of the A/B1 phylogroup-specific isolates and in not more than 30% of the isolates from other phylogroups are included. Download Figure S3, TIF file, 0.5 MB Table S1 List of 2,589 genes commonly expressed in all 21 UTI E. coli isolates. Table S1, PDF file, 0.3 MB. Table S2 De novo mapped genes. The list of genes generated by de novo assembly of the reads, which did not map to any of 54 E. coli genomes. The presence or absence of the gene is indicated as follows: −, no reads detected (nRPK value from 0 to 1.5); +, reads detected with low values (nRPK from 1.5 to 2.0); ++, genes with nRPK values of >2. Table S2, PDF file, 0.1 MB. Table S3 UTI-specific genes. The list of 202 genes, which were upregulated in all four isolates (UTIU3, UTIU5, UTI9, and UTI24) during UTI compared to in vitro conditions. Table S3, PDF file, 0.1 MB. Table S4 (A) Genes whose in vivo expression is specific for phylogenetic group A/B1 isolates; (B) genes whose in vivo expression is specific for phylogenetic group B2 isolates. Table S4, PDF file, 0.3 MB. Table S5 E. coli genomes used for bioinformatics analysis. Table S5, PDF file, 0.1 MB.

          Related collections

          Most cited references75

          • Record: found
          • Abstract: found
          • Article: not found

          Regulatory RNAs in bacteria.

          Bacteria possess numerous and diverse means of gene regulation using RNA molecules, including mRNA leaders that affect expression in cis, small RNAs that bind to proteins or base pair with target RNAs, and CRISPR RNAs that inhibit the uptake of foreign DNA. Although examples of RNA regulators have been known for decades in bacteria, we are only now coming to a full appreciation of their importance and prevalence. Here, we review the known mechanisms and roles of regulatory RNAs, highlight emerging themes, and discuss remaining questions.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths

            Introduction Escherichia coli was brought into laboratories almost a century ago to become one of the most important model organisms and by far the best-studied prokaryote. Major findings in phage genetics, bacterial conjugation, recombination, genetic regulation and chromosome replication involved the use of E. coli, especially laboratory derivatives of the K-12 strain, originally isolated from the faeces of a convalescent diphtheria patient in Palo Alto in 1922 [1]. However, K-12 derivatives are far from representing the whole E. coli species [2]. The primary habitat of E. coli is the lower intestinal tract of humans and other vertebrates, with which it typically establishes commensal associations. Healthy humans typically carry more than a billion E. coli cells in their intestine. It has been estimated that half of the living E. coli cells are outside their host, in their secondary habitat [3]. Beside these habitats, certain strains have the potential to cause a wide spectrum of intestinal and extra-intestinal diseases such as urinary tract infection, septicaemia, meningitis, and pneumonia in humans and animals [4]. Furthermore, Shigella, which have been elevated to the genus order with four species (dysenteriae, flexneri, boydii, sonnei) based on their capacity to generate a specific mucosal invasive diarrhoea strictly in humans and their biochemical characteristics, in fact belong to the E. coli species [5]–[7]. Of note, Shigella and enteroinvasive E. coli are considered the only obligate pathogens of the species, whereas other strains are facultative pathogens with a broad host range. Thus, natural isolates of E. coli/Shigella live in conditions quite different from those in the laboratory and must cope with very diverse environments that provide stresses ranging from immune system attack and protozoal grazing to starvation, low temperatures, and, more recently, antibiotic therapy. With its large range of pathologies, E. coli is a major cause of human morbidity and mortality around the world. Each year E. coli causes more than two million deaths due to infant diarrhoea [8],[9] and extraintestinal infections (mainly septicaemia derived from urinary tract infection) [10], and is also responsible for approximately 150 million cases of uncomplicated cystitis [10]. Since humans and food animals carry so many E. coli cells that may establish commensal or antagonistic interactions with their hosts it is mandatory to define the genetic and population determinants that drive commensal strains to adopt a pathogenic behaviour. Population genetic studies based on both multi-locus enzyme electrophoresis [11]–[13] and various DNA markers [14]–[18] have identified four major phylogenetic groups (A, B1, D and B2) and a potential fifth group (E) among E. coli strains. Strains of these groups differ in their phenotypic characteristics, including the ability to use certain sugars, antibiotic resistance profiles and growth rate–temperature relationships [19]. The distribution (presence/absence) of a range of virulence factors thought to be involved in the ability of a strain to cause diverse diseases also varies among strains of these phylogenetic groups [20]–[22], indicating a role of the genetic background in the expression of virulence [23]. Consequently, these groups are differently associated with certain ecological niches, life-history characteristics and propensity to cause disease. For example, group B2 and D strains are less frequently isolated from the environment [24], but more frequently recovered from extra-intestinal body sites [23]. While B2 strains represent 30 to 50% of the strains isolated from the faeces of healthy humans living in industrialised countries, they account for less than 5% in French Guyana Amerindians [25]–[26]. The clear clustering of E. coli strains into monophyletically meaningful groups has long been used as an argument favouring clonality within the species. However, analysis of gene sequences shows pervasive recombination, matching the well-known efficiency of conjugation and transduction of the species [17],[27]. Hence, it remains controversial whether such frequent recombination obliterates the phylogenetic signal. E. coli genomes show evidence of widespread acquisition of functions by horizontal gene transfer, concomitant with similar amounts of gene deletion [28]–[29]. While less than 3% of nucleotide divergence is found among conserved genes, the gene content between pairs of E. coli genomes may diverge by more than 30% [30]. Such diversification of gene content due to horizontal gene transfer contributes greatly to the diversity of the strains' phenotypes and can be accurately quantified only by the sequencing of a large number of strains to completion and closure. Until now, sequencing efforts in E. coli have been focused mainly on pathogenic strains, particularly on diarrhoeal and group B2 extraintestinal pathogenic strains (see Table 1), precluding an unbiased assessment of the diversity of the species. Therefore, we have sequenced with high coverage and up to completion the genomes of 6 human-source E. coli strains. The E. coli strains were chosen to complement the available sequences and other ongoing sequencing projects (http://msc.jcvi.org/e_coli_and_shigella/index.shtml, http://www.sanger.ac.uk/Projects/Escherichia_Shigella/). They encompass two commensal strains of phylogenetic groups B1 and B2, a group B1 enteroaggregative strain, two group D urinary tract infection strains and a group B2 newborn meningitis strain (Table 1). We also sequenced the type strain of the closest E. coli relative, i.e., E. fergusonii [31], as an outgroup to permit accurate and meaningful evolutionary analyses with the 6 new E. coli genomes and the 14 other currently available E. coli/Shigella genomes. To statistically substantiate the identification of extraintestinal virulence-associated genes, we also applied a mouse lethality assay to the strains [32] to quantify the intrinsic virulence of the strain, excluding host variability and other potential confounding factors (Table 1). Our goal was to take the outstanding opportunity provided by the availability of many genomes of a single bacterial species, regarding which a considerable amount of knowledge has been accumulated over the years, to answer to the following questions. (i) Is there genome-wide evidence of frequent recombination and does it vary with genome location? (ii) If so, can one nonetheless infer an intra-specific bacterial phylogeny? (iii) How do the different factors of genome dynamics (mutation, horizontal gene transfer with or without recombination) result together in strain diversification? (iv) Is genome dynamics in conflict with genome organisation? (v) How does the commensalism/pathogenicity duality evolve? 10.1371/journal.pgen.1000344.t001 Table 1 Principal characteristics of the 20 Escherichia coli/Shigella strains and 1 E. fergusonii strain. Strains Host Sample Clinical condition (Pathotypea) Phylogenetic groupb Extraintestinal mouse model phenotypec (Number of mice killed out of 10) Genome sequence reference K-12 MG1655 Human Faeces Commensal A NK (0) [115] K- 12 W3110 Human Faeces Commensal A NK (0) Nara Institute of Science and Technology IAI1 Human Faeces Commensal B1 NK (0) This work 55989 Human Faeces Diarrhoea (EAEC) B1 K (10) This work S. boydii 4 227 (Sb 227) Human Faeces Shigellosis S1 NDd [116] S. sonnei 046 (Ss 046) Human Faeces Shigellosis SS ND [116] S. flexneri 2a 301 (Sf 301) Human Faeces Shigellosis S3 ND [117] S. flexneri 2a 2457T (Sf 2457T) Human Faeces Shigellosis S3 NK (0) [118] S. flexneri 5b 8401 (Sf 8401) Human Faeces Shigellosis S3 ND [119] S. dysenteriae 1 197 (Sd 197) Human Faeces Shigellosis SD1 ND [116] O157:H7 EDL933 Human Faeces Diarrhoea (EHEC) E NK (1) [120] O157:H7 Sakai Human Faeces Diarrhoea (EHEC) E NK (1) [121] UMN026 Human Urine Cystitis (ExPEC) D K (10) This work IAI39 Human Urine Pyeloneprhitis (ExPEC) D K (8) This work UTI89 Human Urine Cystitis (ExPEC) B2 K (10) [122] APEC O1 Chicken Lung Colisepticemia (ExPEC) B2 K (10) [123] S88 Human Cerebro-spinal fluid New born meningitis (ExPEC) B2 K (10) This work CFT073 Human Blood Pyeloneprhitis (ExPEC) B2 K (10) [30] ED1A Human Faeces Healthy subject B2 NK (0) This work 536 Human Urine Pyeloneprhitis (ExPEC) B2 K (10) [124] E. fergusonii Human Faeces Unknown Outgroup NK (1) This work The strains in bold correspond to the strains sequenced in this work. a EAEC (enteroaggregative E. coli), EHEC (enterohaemorrhagic E. coli), ExPEC (extraintestinal pathogenic E. coli). b The E. coli and Shigella phylogenetic groups are as defined in [22] and [6], respectively. c K, killer; NK, Non Killer [32]. d ND, not determined. Results/Discussion The General Features of the Seven Sequenced Genomes We fully sequenced the chromosomes and the plasmids, if any, of 6 strains of E. coli and the reference type strain of E. fergusonii. The general features of these replicons are listed in Tables 2 and 3. Genomes were sequenced at an average of 12-fold coverage and were then finished. The 6 newly sequenced E. coli chromosomes contain between 4.7 Mb and 5.2 Mb each, corresponding to between 4627 and 5129 protein coding genes, slightly above the average value within the 20 genomes that we analyzed (∼4700 genes, ranging from 4068 to 5379). The chromosome of E. fergusonii is slightly smaller with ∼4.6 Mb and ∼4500 protein coding genes. The G+C content is very similar among the 6 strains and close to the E. coli K-12 MG1655 value (∼50.8%). The G+C content of E. fergusonii is lower at 49.9%. These chromosomes have similar densities of coding genes and numbers of stable RNA genes. By contrast, the number of pseudogenes varies more widely, from 22 in E. fergusonii to 95 in strain ED1a (Table 2). The list of pseudogenes is available in Table S1. 10.1371/journal.pgen.1000344.t002 Table 2 General features of the Escherichia coli and E. fergusonii genomes sequenced in this work with E. coli K-12 MG1655 as reference (chromosome features). Chromosome features E. coli K-12 MG1655 E. coli strains E. fergusonii ATCC 55989 IAI1 ED1a S88 IAI39 UMN026 Genome Size (bp) 4 639 675 5 154 862 4 700 560 5 209 548 5 032 268 5 132 068 5 202 090 4 588 711 G+C content (%) 50.8 50.7 50.8 50.7 50.7 50.6 50.7 49.9 rRNA operons 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) tRNA genes 86 94 86 91 91 88 88 87 Total Protein-coding genesa 4306 4969 4491 5129 4859 4906 4918 4336 Pseudogenesb (nb) 81 79 51 95 90 80 45 22 Protein coding densityc 85.7 87.4 87.6 86.2 87 86.1 87.8 84.7 Assigned functiond (%) 80 74 77 74 77 78 76.5 77 Conserved hypothetical (%) 12.5 23 21.5 23 22 20 22 20 Orphans (%) 7.5 3 1.5 3 1 2 1.5 3 IS-like genes (nb) 66 150 42 118 47 224 92 29 Phage-associated genes (nb) 231 406 201 657 507 393 429 235 a The number of protein-coding genes is given without the number of coding sequences annotated as artifactual genes (Supplementary Table 2A). b The number of pseudogenes computed for each genome corresponds to the real number of genes that are pseudogenes: one pseudogene can be made of only one CDS (in this case the gene is partial compared to the wild type form in other E. coli strains) or of several CDSs (generally two or three CDSs corresponding to the different fragments of the wild type form in other E. coli strains). These lists of pseudogenes are available in Supplementary Table 1. c The computed protein coding density takes into account the total length of protein genes excluding overlaps between genes, artifacts, and RNA genes. d Protein genes with assigned function include the total number of definitive and putative functional assignments. 10.1371/journal.pgen.1000344.t003 Table 3 General features of the Escherichia coli and E. fergusonii genomes sequenced in this work with E. coli K-12 MG1655 as reference (plasmid features). Plasmid features E. coli strains E. fergusonii ATCC 55989 ED1a S88 UMN026 Genome Size (bp) 72 482 119 594 133 853 122 301 33 809 55 150 G+C content (%) 46.1 49.2 49.3 50.5 42 48.5 Total Protein-coding genesa 100 150 144 149 49 54 Pseudogenesb (nb) 7 11 9 8 0 5 Protein coding densityc 75.6 86.2 87 79.4 87.5 88.7 Assigned functiond (%) 74 53 65 65.7 35.4 46.6 Orphans (%) 17 31.5 25.8 27.8 12.5 20.7 Hypothetical (%) 9 15.5 9.2 6.5 52.2 32.7 IS-like genes (nb) 18 14 14 15 0 4 a The number of protein-coding genes is given without the number of coding sequences annotated as artifactual genes (Supplementary Table 2A). b The number of pseudogenes computed for each genome corresponds to the real number of genes that are pseudogenes: one pseudogene can be made of only one CDS (in this case the gene is partial compared to the wild type form in other E. coli strains) or of several CDSs (generally two or three CDSs corresponding to the different fragments of the wild type form in other E. coli strains). These lists of pseudogenes are available in Supplementary Table 1. c The computed protein coding density takes into account the total length of protein genes excluding overlaps between genes, artifacts, and RNA genes. d Protein genes with assigned function include the total number of definitive and putative functional assignments. The variation in the number of pseudogenes is uncorrelated with the number of transposable elements and phage-associated genes, which vary in the range 42–224 and 201–517 respectively. While some phage-associated genes are scattered throughout the chromosomes, the majority are concentrated in well-defined prophage regions. Analyses of the prophages suggest that many may still be functional. These prophages often carry at their extremity some unrelated cargo genes that probably arose from genomes of previously infected bacteria, as found in Salmonella [33]. We sequenced a total of 6 plasmids, varying in size from 34 to 134 kbp: four strains possess one plasmid each whereas one strain has 2 plasmids (Table 3). As frequently noted, the plasmids have a lower gene density (84%, vs. 87% for chromosomes), lower G+C content (47.4%, vs. 50.7% for chromosomes) and more pseudogenes (2.7%, vs. 1.5% for chromosomes). The percentage of orphan proteins (i.e., having no detectable homolog in other organisms) is also high on plasmids (6.5 to 52.2%), while it ranges between 1–3% on the chromosomes. A manual expert annotation of the new E. coli strains was performed on genes and regions not found in E. coli K-12 MG1655 (about 10 000 genes in total; Table S2A). This allowed the re-annotation of orthologs in the previously available Escherichia and Shigella genomes (see Materials and Methods). The annotation data, together with the results of the comparative analysis were stored in a relational database called ColiScope, which is publicly available using the MaGe Web-based interface at http://www.genoscope.cns.fr/agc/mage. This re-annotation process revealed extensive variations in the number of the newly predicted genes (Table S2B). For example, between the two strains of E. coli O157:H7 we found twice as many newly predicted genes in one strain as in the other. In some genomes important genes were missing. For example, in E. coli APEC O1 several subunits of the ribosome, DNA polymerase III, and ATP synthase were missing in the original annotation (Table S3, E. coli APEC sheet). In other genomes, the re-annotation allowed us to standardise the definition and identification of pseudogenes. For example, in S. sonnei Ss 046 most of the newly annotated genes correspond to insertion sequences (ISs) and small fragments of incompletely annotated pseudogenes (Table S3, S. sonnei sheet). As a result of this effort, the present ColiScope database contains a complete and consistent set of annotations for the 7 newly sequenced genomes and the 14 available Escherichia and Shigella genomes. These data were the starting point of the work presented here. We analyzed gene order conservation within the 21 genomes (Table S4). More than half of the genomes have exactly the gene order of E. coli K-12 MG1655, which we inferred as ancestral. Thus, the organisation of the core genome is stable in most strains. Three genomes show 1 or 2 rearrangements. Seven genomes show more than 10 blocks of synteny: 6 of these genomes are from Shigella, the high rearrangement rates of which resulted in up to 65 blocks of synteny in S. dysenteriae. These genomes have a large number of ISs, ranging from 549 to 1155 in S. flexneri and S. dysenteriae, respectively, which are well known to shuffle genomes. E. fergusonii also shows a large number of rearrangements relative to the ancestral organization of the E. coli genome. Since the organisation of some strains of the more distantly related Salmonella enterica closely resembles that of E. coli K-12 MG1655, many rearrangements must have taken place in the branch leading to E. fergusonii. Figure S1 provides the classical concentric circle representation for the 7 genomes we sequenced, showing GC skews, G+C variation, and a description of the presence of genes in ever-increasing clades within the genus, relative to the inferred ancestral genome. The first position of the sequences was chosen to match the orthologous region in the E. coli K-12 MG1655 genome and corresponds to the intergenic region between lasT and thrL. Origins and termini of replication were identified by GC skews and homology with the respective E. coli K-12 MG1655 regions. These figures show that divergence from the average G+C content often occurs in genomic regions absent in the other strains. They also reveal the highly mosaic structure of these genomes, comprising the core genes and the accessory genes, which we then set out to quantify. The Core and Pan-Genomes of E. coli The analysis of the first E. coli genomes changed our views about the evolution of gene repertoires in bacteria. Genomes within the species vary in size by more than 1 Mb, i.e., by more than 1000 genes, and even the gene repertoires of similarly sized genomes differ widely [30],[34]. We have thus taken advantage of the unprecedented availability of 20 completely sequenced genomes of the same species to analyse the evolution of the gene repertoire. We first identified the core and pan-genomes of E. coli, i.e., the genes present in all genomes and the full set of non-orthologous genes among all genomes. In our data set, the average E. coli genome contains 4721 genes, the core genome contains 1976 genes, and the pan-genome contains 17 838 genes. The random sampling of one gene within a randomly selected E. coli genome has a probability of only ∼42% of revealing a ubiquitous gene. On the other hand, the full sequencing of an E. coli strain allows observation of only one-fourth of the observed pan-genome. This implies that although some fundamental functions can be well studied by using a model strain, no single strain can be regarded as highly representative of the species. Further sampling of E. coli genomes is unlikely to change significantly the estimate of the core genome, however, the pan-genome is far from being fully uncovered (Figure 1). Annotation and sequencing artefacts may affect the estimations of core and pan-genome sizes, e.g. by spurious annotation of small genes or pseudogenes. We hope to have minimised such problems by using a coherent set of annotations. Still, we found that 40 genes deemed essential in E. coli K-12 W3110 [35] were missing in the core genome. Among these, 17 correspond to genes with conflicting reports of essentiality, or contextually essential genes such as prophage repressors, and are absent in most genomes. The other 23 genes have orthologs in most genomes and 19 are missing in a single genome where they can be found as pseudogenes interrupted by a single-nucleotide frameshift. While “pseudogenisation” does often start with such frameshifts [36], these genes correspond to core housekeeping functions, so the reported frameshifts probably represent sequencing errors. For example, it is hard to see how S. boydii could replicate without the catalytic α-subunit of the DNA polymerase III or how E. coli 536 could survive without a tyrosine tRNA synthetase. We found some comfort in verifying that none of the 23 genes was absent from the 7 genomes we sequenced. If one assumes that these essential genes cannot be deleted and that no special care has been taken to check for sequencing errors at these loci, then our estimation of the core genome should be increased by a factor of 260/(260-23) to 2167 genes. This still makes the core genome less than half of the average E. coli genome (∼46%). Importantly, no gene of the core genome, nor any operon ubiquitous in E. coli, was unique to the species, i.e., we could always find a homolog in at least one of the other fully sequenced bacterial genomes. 10.1371/journal.pgen.1000344.g001 Figure 1 Escherichia coli core and pan-genome evolution according to the number of sequenced genomes. Number of genes in common (left) and total number of non-orthologous genes (right) for a given number of genomes analysed for the different strains of E. coli. The upper and lower edges of the boxes indicate the first quartile (25th percentile of the data) and third quartile (75th percentile), respectively, of 1000 random different input orders of the genomes. The central horizontal line indicates the sample median (50th percentile). The central vertical lines extend from each box as far as the data extend, to a distance of at most 1.5 interquartile ranges (i.e., the distance between the first and third quartile values). At 20 sequenced genomes, the core-genome had 1976 genes (11% of the pan-genome), whereas the pan-genome had (i) 17 838 total genes (black), (ii) 11 432 genes (red) with no strong relation of homology ( 0.05), suggesting that our inference is unbiased with respect to successive events taking place at the same locations in long branches or by selection-purging older events in internal branches. Variation in gene repertoires has been described as being scattered on the chromosome of E. coli and balanced between the two replichores [61]. For the numerous small insertions and deletions this distribution results naturally from random insertion/deletion of genetic material. Such small indels are expected to have little impact on the large-scale organisation of the genome. What about the very large insertions/deletions? The 554 such events that involve more than 10 genes over-represent insertions over deletions (Fisher exact test, p 0.05). This shows that insertion/deletion hotspots are also rearrangement hotspots, even though we initially removed rearranged positions to identify the insertion/deletion hotspots (thus being conservative). It also suggests that rearrangements occur in these regions because they are permissive to change not because they are intrinsically recombinogenic, since the frequency with which they rearrange simply reflects their larger size. However, even if hotspots are not intrinsically recombinogenic they can still be caused by the brokering effect of homologous recombination. Indeed, incoming DNA once integrated in one genome can propagate within the population by lateral transfer via classical homologous recombination involving the homologous flanking regions. Given the observed rates of recombination in the species, this mechanism could quickly lead to the horizontal spread of highly adaptive newly acquired genes. We describe some evidence for this in the next section. Hotspots of Phylogenetic Incongruence For any given sequence alignment, the likelihood of the overall gene tree topology, i.e., the phylogenetic congruence, reflects the extent to which the phylogenetic signal of the sequences was altered by recombination. While the concatenate of genes provides a strong phylogenetic signal, the individual genes' histories can be very diverse as a result of recombination. Furthermore, these histories may depend on the genes' positioning in the chromosome. Notably, if homologous recombination helps in disseminating recent acquisitions, as we propose, the core genome around these hotspots should show signs of recombination as indicated by phylogenetic incongruence. We therefore made an analysis in 5 kbp sliding windows along the multiple genome alignment to identify the most phylogenetically incongruent regions (see Material and Methods). This method identified two large regions of very strong incongruence, one centred around rfb (Figure S8), the operon involved in O antigen synthesis, and the other around the leuX tRNA gene, and including fimA, which is under diversifying selection and is involved in the adhesion of bacteria to host cells [66]. Both loci were previously identified as hotspots of phylogenetic incongruence [67],[68]; the present analysis reveals how much they affect the chromosome. Recombination at the rfb locus significantly affects congruence within a striking 150 kbp surrounding region, i.e., from positions 1988 kbp to 2138 kbp (100% of windows tested had scores lower than 1.96 standard deviation away from the average, with an average of −4.84 and peaks at −10.19). The fim locus includes an incongruence region close to 200 kbp in length (from positions 4421 kbp to 4618 kbp, average −2.54 standard deviation and 73% with lower than −1.96 standard deviation and peaks at −6.65). Interestingly those two regions are centered on integration hotspots and encompass 11 of the 133 hotspots of integration. The genes present in such loci arose most likely by lateral transfer since they are highly dissimilar between strains. For example, genes at the rfb locus genes can exhibit less than 50% similarity, while the leuX locus encompasses a highly variable assortment of non-homologous inserts in all the genomes sequenced. Hence at least for those two major loci we find a striking link between hotspots of integration and hotspots of homologous recombination. In the case of the rfb locus, it is worth noting that the incongruence signal we observe might be a composite signal, due not only to rfb but also to neighbouring loci. Within the above defined rfb region of incongruence, a flagella locus (fli operon) associated with two hotspots of integration is also under diversifying selection. Moreover, the high pathogenicity island (HPI) is integrated within that high recombination region in many isolates and corresponds also to a hotspot of integration. It has been suggested that after a recent and unique integration event, the HPI has propagated within the species by homologous recombination [69]. The propagation or diversification of these loci, located to the left of rfb, through homologous recombination might generate the asymmetrical pattern of phylogenetic incongruence we observe around the rfb locus (extended incongruence on the left side of the rfb locus) (Figure S8). We found 23 other regions with weaker signatures of incongruence (i.e., with a 5 kbp sequence incongruence score more than 2 standard deviations from the average), each spanning less than 20 kbp. It is important to note that most of these incongruent regions include genes involved in diversification of genetic information and often pathogenicity. The vast majority of these include 3 groups of common genes. First are regions with the porin-encoding genes ompA and ompC, the flagella-encoding genes, the rfa locus coding for the core lipopolysaccaride and genes coding for several membrane proteins such as LolCDE, CcmABCDE, ABC transporter, AroP APC transporter, LplT-aas, FadK, YeaY, EamB, YhgE and YicG membrane proteins. These loci are probably involved in diversifying selection since they code for antigenic proteins exposed at the cell surface. Second are two regions encompassing mismatch repair genes (mutS and mutH) that have been shown to be under selection for cycles of inactivation and reacquisition through recombination [70]. Third is a region associated with the integration of a locus that can provide resistance to phages through clustered regulatory interspaced short palindromic repeats (CRISPRs) [71]. All available methods estimate the effective, not the intrinsic, recombination rate. Effective recombination results from the intrinsic recombination rate and the ensuing selection on recombinants. Most of the phylogenetic incongruence hotspots we found contain genes under diversifying selection, for instance to escape immune pressure or to acquire resistance to phage. Hence, it is very likely that differences in the intensity of selection might be responsible for the differences observed in the size of the regions affected by a phylogenetic incongruence hotspot. A recombinant carrying a new allele at a locus under strong diversifying selection will be selected and thus will increase rapidly in frequency in populations. Hence, the recombinant will invade the local population before any further recombination occurs at the locus [72]. In that case, sampling the genome after the action of natural selection allows identification of the original recombining fragment. In contrast, if selection is moderate, the recombining fragment that brought the interesting allele into the genome will be covered by many further recombination events before it reaches high frequency. In this case, only fragments around the selected allele will retain the trace of the recombination event. As a consequence, when selection is intense, one expects to identify long recombinant fragments in some strains, as we did at the rfb or leuX loci. Our observations suggest that the intensity of diversifying selection acting on the rfb and leuX-fimH loci are under very strong selective pressure compared with the diversifying selection acting on the core LPS, the flagella or some of the porins. The fact that most hotspots of integration (117 among 133) do not result in hotspots of phylogenetic incongruence suggests that they carry neutral or deleterious genes. Conversely, it also suggests that some horizontally acquired genes can be highly beneficial (e.g., 11 hotspots of phylogenetic incongruence around the rfb or leuX-fimH locus) or moderately beneficial (e.g., 4 hotspots of integration associated with hotspots of phylogenetic incongruence) and that this results in different selection footprints in the neighbouring core genome. Recombination and Chromosome Organisation The existence of integration and phylogenetic incongruence hotspots brings to the fore the conflict between genome dynamics and organisation. We therefore analysed the variation in recombination along the backbone sequence (as estimated by a population genetic-based approach), using a sliding window of 3 kbp on the multiple genome alignment and a step size of 500 bp. This analysis revealed a large region around the terminus of replication with a particularly low ratio between gene conversion and mutation rates (Cgc/theta) (Figure 10). The region between 1 Mb and 2 Mb shows lower gene conversion rates, since there is 20% lower chance for a base to be involved in a gene conversion event (Cgc×Lgc, unilateral t test: p = 1e-21). This region also shows 10% lower levels of polymorphism (theta of Watterson, p = 1e-7), i.e., variations within the E. coli species, and 2% lower G+C content (Figure 10). A+T richness at the terminus region has been suggested to result from higher mutation rates [73]. Based on comparative genomics with Salmonella, it was also shown that divergence, i.e., the genetic distance between species, slightly increased closer to the terminus [74],[75], further supporting the hypothesis of a higher local mutation rate. Using our newly sequenced outgroup genome E. fergusonii, which unlike Salmonella does not shows saturation of synonymous substitutions, we found that the terminus domain has synonymous and non-synonymous substitution rates twice as high as the rest of the chromosome. While decreased G+C content and increased divergence could reflect a higher mutation rate at the terminus, such an interpretation is contradicted by the observed lower polymorphism. 10.1371/journal.pgen.1000344.g010 Figure 10 Standardized cumulative sum of effective gene conversion rate and G+C content. Gene conversion rate (i.e., probability of being involved in a gene conversion event Cgc.Lgc) is shown in blue, and G+C content in red. A decrease in the cumulative sum reflects regions of lower-than-expected values of the statistics. Around the terminus domain, we found a decrease in both recombination and G+C content. Coloured boxes represent the 4 different organisation macrodomains (Right, Ter, Left, Ori). The arrows point towards the origin and terminus of replication. Theoretical population genetic studies have shown that the fluctuation of recombination frequency along chromosomes affects the level of polymorphism and the efficiency of selection [76]. When there are numerous deleterious mutations and low recombination rates, a fraction of the population bearing deleterious alleles is doomed to disappear in the long term without contributing to the gene pool of the future population. The relevance of this phenomenon, referred to as background selection, requires the existence of deleterious mutations of moderate effects, i.e., mutations that can persist for some time in the population before selection wipes them out. At the population level, this result in an excess of rare alleles, which can be estimated by Tajima's D statistics. We found that overall a gene's average Tajima's D was slightly negative (indicating an excess of rare alleles). However, the Tajima's D of synonymous mutations was null, while that of non-synonymous mutations was much more negative (Figure S9). This suggests that most non-synonymous mutations are deleterious since, in contrast to synonymous mutations, they do not increase in frequency within the population reflecting the purging effect of natural selection. Therefore the conditions for the action of background selection are met. Furthermore, under background selection, a reduced recombination rate results in a decreased polymorphism (such as we observed around the terminus), an increased fraction of rare alleles and a decreased efficiency of selection [76]. The terminus region shows a lower Tajima's D than the rest of the chromosome (Student bilateral test, p 8 mice killed) [32]. Supporting Information Figure S1 Circular representation of the six Escherichia coli genomes and the E. fergusonii genome. Circles display from the inside out: (1) GC skew (G+C/G−C using a 1 kbp sliding window). (2) Location of tRNA genes, rRNA operons and Insertion Sequences (ISs). (3) GC deviation (difference between the mean GC content in a 1 kbp window and the overall mean GC). Red areas indicate that the deviation is greater than 2 standard deviations. (4) Ancestral E. coli genome. Yellow areas denote genes that are present in all the genomes under study. (5) Scale. (6) Gene specificity at strain level. Genes sharing at least one homolog in an other E. coli strain of the same phylogenetic group and having more than 85% identity over at least 80% of their length were regarded as non specific. To simplify the visualisation of specific regions, we created a colour gradient that denotes the percentage of organisms that possess a homolog of a given gene within the reference genome. If this particular gene is present in all the organisms under study, it is tagged in light grey. Conversely, if it is present only in the reference genome, it is tagged in dark colour. In other words, the more pronounced the colour, the higher the specificity. (7) Gene specificity at group level. The same criteria were used as for circle (6) but the genome analysed is compared to E. coli strains that belong to other phylogenetic groups. The comparison includes Shigella as well. (8) Gene specificity at the species level. The same protocol was used as for circles (6) and (7) except the comparison involves E. fergusonii which is considered as the outgroup for this study. (1.82 MB PPT) Click here for additional data file. Figure S2 Visual representation of MAUVE multiple alignment of 20 Escherichia coli/Shigella genomes. The representation was performed using the MOSAIC database (http://genome.jouy.inra.fr/mosaic/) multiple alignment viewer. Horizontal lines correspond to a linear representation of each genome sequence drawn to scale. The blue line corresponds to annotated genes. (At this scale only a unique line is visible.) Coloured blocks correspond to the locally collinear blocks (LCBs) of the alignment as defined by MAUVE. LCBs corresponding to inversions are represented on a second line. An LCB in one genome is linked to the corresponding LCB on the subsequent genome with a plot of the same colour. This visual representation shows that, apart from the rearrangements present in Shigella chromosomes, E. coli genomes are mostly collinear. (0.11 MB PPT) Click here for additional data file. Figure S3 Phylogenetic tree of the backbone of the 20 Escherichia coli and Shigella strains as reconstructed by MAUVE software. This unrooted tree was built using Tree-puzzle with the HKY+gamma (with 8 categories)+I model followed by BioNJ to reconstruct the tree from the distance matrix. The values at the nodes correspond to support values for each internal branch, as estimated by Tree-puzzle (range 0–100), and can be interpreted in much the same way as bootstrap values. (0.09 MB PPT) Click here for additional data file. Figure S4 Association between gene repertoire relatedness and phylogenetic distance. A. Genomes were binned according to phylogenetic distance for clarity. For the first two bins, which correspond to the most closely related genes, there is a high percentage of genome in common, which is not the case for the other bins, which correspond with more distantly related genes. B. Histogram of the phylogenetic distances between pairs of genomes. (0.04 MB PPT) Click here for additional data file. Figure S5 Reconstruction of gains and losses of genes in the evolution of Escherichia coli. The cladogram shows the phylogenetic relationships between the 20 E. coli/Shigella genomes rooted on the E. fergusonii genome, as in Figure 4, with branch lengths ignored for clarity. Each strain and internal node of the tree is labelled with the inferred numbers of genes gained (red: top) and lost (black: top), and the inferred numbers of corresponding events of gene acquisition (red: bottom) and loss (black: bottom) along the branch. Pie charts on each branch indicate the functional classification of genes lost, using the colour-scale (details in the keys). The functional classes of known-function genes are represented by numbers explained by a key in Supplementary Table 4. (0.31 MB PPT) Click here for additional data file. Figure S6 Association between the distance of a node to the tip of the tree and the difference between the predicted ancestral genome size and the effective number of genes reliably predicted to be present in the node. The association is highly significant (R2 = 0.56, p<0.001). (0.03 MB PPT) Click here for additional data file. Figure S7 Characteristics of hotspots of insertion/deletion of genetic material. The circles indicate the values per location between contiguous genes in the core genome. Data are (from the outside circle inwards): average number of genes, standard deviation, sum of genes, number of prophage-like elements, number of insertion sequence like elements, sum of tRNA genes and heterogeneity rate at hotspots. The latter is the ratio between the observed number of orthologs and the expected value if all genes had orthologs in all genomes, after excluding genomes lacking genes at the hotspot and those for which the region has a synteny breakpoint. (0.14 MB PPT) Click here for additional data file. Figure S8 Phylogenetic congruence at the rfb locus coding for O antigen. We followed the likelihood of the species topology for 5 kbp windows (spaced by 250 bp) along the chromosomal backbone. After correcting for the number of polymorphic sites, each window received a Z score of phylogenetic congruence. Low values reflect lower than average phylogenetic congruence. A large region (green arrow) has a significantly lower congruence than the rest of the genome. The red arrows indicate the hotspots of integration and the corresponding loci when identified. HPI: high pathogenicity island. (0.03 MB PPT) Click here for additional data file. Figure S9 Distribution of Tajima's D statistics on the 1976 Escherichia coli core genome genes. The colour code is as follows: all mutations (red), synonymous mutations (green) and non-synonymous mutations (yellow). Negative Tajima's D values [126] reflect a higher than expected frequency of rare alleles. The more negative value of Tajima's D for non-synonymous mutations suggests that they are on average deleterious: they persist some time in populations before selection removes them. (0.08 MB PPT) Click here for additional data file. Table S1 Pseudogenes found in Escherichia coli K-12 MG1655 and the 7 newly sequenced genomes of the ColiScope project. The ‘Reference’ column give the coding sequence (CDS) identifier of the wild type form of the gene, in one of the 21 analyzed Escherichia and Shigella bacterial genomes. For each of these 21 genomes the particular gene's status is indicated as functional (‘1’), absent (‘0’) or a pseudogene (‘−1’). Gene names in boldface correspond to genes that are pseudogenes only in the considered strain. (0.25 MB XLS) Click here for additional data file. Table S2 A) Number of predicted protein encoding genes in the genomes of the newly sequenced strains of Escherichia coli and E. fergusonii. Genes were (a) functionality annotated using automatic annotation transfer from K-12 MG1655 orthologs or other ColiScope manually annotated orthologous genes, (b) manually annotated using the MaGe web-based graphical interface, or (c) considered as false positive gene predictions. B) Publicly available Escherichia and Shigella genomes included in the ColiScope database. (a) Inaccurate (‘Wrong’ status) or missed gene annotations (‘New’ status) have been found using our MICheck procedure. For the 14 analyzed genomes, the list of newly predicted genes is given in Supplementary Table 3. (b) Automatic functional annotation transfer between orthologous genes (85% identity over at least 80% of the length of the smallest protein) began with similarity results obtained with E. coli K-12 MG1655, then with the new genomes of the ColiScope project. False gene predictions (i.e., artefacts) were those defined in the course of the expert annotation of the ColiScope sequences. (c) ‘Specific genes’ are genes that have no ortholog in E. coli K-12 MG1655 or any of the newly sequenced and annotated genomes. (0.05 MB DOC) Click here for additional data file. Table S3 Missing genes in publicly available Escherichia coli and Shigella genomes. The genes are ordered by length (given in base pairs). Those that are similar to genes from the minimal gene set defined by [127] are highlighted in boldface. Functional descriptions of the genes, starting with ‘fragment of’ (Product column), are provided for putative pseudogenes (whether actual pseudogenes or sequencing errors). For some E. coli stains, e.g., UTI89, the corresponding pseudogenes were probably correctly annotated by the authors (numbering of the gene locus_tags), but were not reported in the databank files (GeneBank and EMsBL), and thus were annotated as missing genes by the MICheck procedure [97]. (0.36 MB XLS) Click here for additional data file. Table S4 Synteny blocks and insertion sequence (IS) elements among the 21 Escherichia coli/Shigella/E. fergusonii genomes. (0.03 MB DOC) Click here for additional data file. Table S5 Classification of bioprocesses (key for Figure 7). The tests indicate the sense of the difference in the number of genes associated with a given bioprocess. ‘+/−’ means more/fewer genes in the first class, i.e., more/fewer in the core genome than in the complementary set. ‘++/−−’ means the difference is significant at the 5% level, using a Chi square test followed by a sequential Bonferroni correction for multiple tests. (0.02 MB DOC) Click here for additional data file. Table S6 Genes (and associated characteristics) categorically associated with certain phylogenetic groups or pathotypes. Principal characteristics of the genes were deduced from the annotation process. (0.23 MB DOC) Click here for additional data file. Table S7 Genes of the genomic island at the pheV tRNA insertion hot spot (see Figure 9). (0.08 MB XLS) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli.

              We present the complete genome sequence of uropathogenic Escherichia coli, strain CFT073. A three-way genome comparison of the CFT073, enterohemorrhagic E. coli EDL933, and laboratory strain MG1655 reveals that, amazingly, only 39.2% of their combined (nonredundant) set of proteins actually are common to all three strains. The pathogen genomes are as different from each other as each pathogen is from the benign strain. The difference in disease potential between O157:H7 and CFT073 is reflected in the absence of genes for type III secretion system or phage- and plasmid-encoded toxins found in some classes of diarrheagenic E. coli. The CFT073 genome is particularly rich in genes that encode potential fimbrial adhesins, autotransporters, iron-sequestration systems, and phase-switch recombinases. Striking differences exist between the large pathogenicity islands of CFT073 and two other well-studied uropathogenic E. coli strains, J96 and 536. Comparisons indicate that extraintestinal pathogenic E. coli arose independently from multiple clonal lineages. The different E. coli pathotypes have maintained a remarkable synteny of common, vertically evolved genes, whereas many islands interrupting this common backbone have been acquired by different horizontal transfer events in each strain.
                Bookmark

                Author and article information

                Journal
                mBio
                mBio
                American Society for Microbiology
                2150-7511
                July 01 2014
                August 29 2014
                August 05 2014
                August 05 2014
                : 5
                : 4
                Article
                10.1128/mBio.01075-14
                ed411693-aef5-4e57-aad4-40816ac3b295
                © 2014
                History

                Comments

                Comment on this article