Improving pan-genome annotation using whole genome multiple alignment

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Rapid annotation and comparisons of genomes from multiple isolates (pan-genomes) is becoming commonplace due to advances in sequencing technology. Genome annotations can contain inconsistencies and errors that hinder comparative analysis even within a single species. Tools are needed to compare and improve annotation quality across sets of closely related genomes.

Results

We introduce a new tool, Mugsy-Annotator, that identifies orthologs and evaluates annotation quality in prokaryotic genomes using whole genome multiple alignment. Mugsy-Annotator identifies anomalies in annotated gene structures, including inconsistently located translation initiation sites and disrupted genes due to draft genome sequencing or pseudogenes. An evaluation of species pan-genomes using the tool indicates that such anomalies are common, especially at translation initiation sites. Mugsy-Annotator reports alternate annotations that improve consistency and are candidates for further review.

Conclusions

Whole genome multiple alignment can be used to efficiently identify orthologs and annotation problem areas in a bacterial pan-genome. Comparisons of annotated gene structures within a species may show more variation than is actually present in the genome, indicating errors in genome annotation. Our new tool Mugsy-Annotator assists re-annotation efforts by highlighting edits that improve annotation consistency.

Related collections

Most cited references 22

Record: found
Abstract: found
Article: not found

Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths

Marie Touchon, Claire Hoede, Olivier Tenaillon … (2009)

Introduction Escherichia coli was brought into laboratories almost a century ago to become one of the most important model organisms and by far the best-studied prokaryote. Major findings in phage genetics, bacterial conjugation, recombination, genetic regulation and chromosome replication involved the use of E. coli, especially laboratory derivatives of the K-12 strain, originally isolated from the faeces of a convalescent diphtheria patient in Palo Alto in 1922 [1]. However, K-12 derivatives are far from representing the whole E. coli species [2]. The primary habitat of E. coli is the lower intestinal tract of humans and other vertebrates, with which it typically establishes commensal associations. Healthy humans typically carry more than a billion E. coli cells in their intestine. It has been estimated that half of the living E. coli cells are outside their host, in their secondary habitat [3]. Beside these habitats, certain strains have the potential to cause a wide spectrum of intestinal and extra-intestinal diseases such as urinary tract infection, septicaemia, meningitis, and pneumonia in humans and animals [4]. Furthermore, Shigella, which have been elevated to the genus order with four species (dysenteriae, flexneri, boydii, sonnei) based on their capacity to generate a specific mucosal invasive diarrhoea strictly in humans and their biochemical characteristics, in fact belong to the E. coli species [5]–[7]. Of note, Shigella and enteroinvasive E. coli are considered the only obligate pathogens of the species, whereas other strains are facultative pathogens with a broad host range. Thus, natural isolates of E. coli/Shigella live in conditions quite different from those in the laboratory and must cope with very diverse environments that provide stresses ranging from immune system attack and protozoal grazing to starvation, low temperatures, and, more recently, antibiotic therapy. With its large range of pathologies, E. coli is a major cause of human morbidity and mortality around the world. Each year E. coli causes more than two million deaths due to infant diarrhoea [8],[9] and extraintestinal infections (mainly septicaemia derived from urinary tract infection) [10], and is also responsible for approximately 150 million cases of uncomplicated cystitis [10]. Since humans and food animals carry so many E. coli cells that may establish commensal or antagonistic interactions with their hosts it is mandatory to define the genetic and population determinants that drive commensal strains to adopt a pathogenic behaviour. Population genetic studies based on both multi-locus enzyme electrophoresis [11]–[13] and various DNA markers [14]–[18] have identified four major phylogenetic groups (A, B1, D and B2) and a potential fifth group (E) among E. coli strains. Strains of these groups differ in their phenotypic characteristics, including the ability to use certain sugars, antibiotic resistance profiles and growth rate–temperature relationships [19]. The distribution (presence/absence) of a range of virulence factors thought to be involved in the ability of a strain to cause diverse diseases also varies among strains of these phylogenetic groups [20]–[22], indicating a role of the genetic background in the expression of virulence [23]. Consequently, these groups are differently associated with certain ecological niches, life-history characteristics and propensity to cause disease. For example, group B2 and D strains are less frequently isolated from the environment [24], but more frequently recovered from extra-intestinal body sites [23]. While B2 strains represent 30 to 50% of the strains isolated from the faeces of healthy humans living in industrialised countries, they account for less than 5% in French Guyana Amerindians [25]–[26]. The clear clustering of E. coli strains into monophyletically meaningful groups has long been used as an argument favouring clonality within the species. However, analysis of gene sequences shows pervasive recombination, matching the well-known efficiency of conjugation and transduction of the species [17],[27]. Hence, it remains controversial whether such frequent recombination obliterates the phylogenetic signal. E. coli genomes show evidence of widespread acquisition of functions by horizontal gene transfer, concomitant with similar amounts of gene deletion [28]–[29]. While less than 3% of nucleotide divergence is found among conserved genes, the gene content between pairs of E. coli genomes may diverge by more than 30% [30]. Such diversification of gene content due to horizontal gene transfer contributes greatly to the diversity of the strains' phenotypes and can be accurately quantified only by the sequencing of a large number of strains to completion and closure. Until now, sequencing efforts in E. coli have been focused mainly on pathogenic strains, particularly on diarrhoeal and group B2 extraintestinal pathogenic strains (see Table 1), precluding an unbiased assessment of the diversity of the species. Therefore, we have sequenced with high coverage and up to completion the genomes of 6 human-source E. coli strains. The E. coli strains were chosen to complement the available sequences and other ongoing sequencing projects (http://msc.jcvi.org/e_coli_and_shigella/index.shtml, http://www.sanger.ac.uk/Projects/Escherichia_Shigella/). They encompass two commensal strains of phylogenetic groups B1 and B2, a group B1 enteroaggregative strain, two group D urinary tract infection strains and a group B2 newborn meningitis strain (Table 1). We also sequenced the type strain of the closest E. coli relative, i.e., E. fergusonii [31], as an outgroup to permit accurate and meaningful evolutionary analyses with the 6 new E. coli genomes and the 14 other currently available E. coli/Shigella genomes. To statistically substantiate the identification of extraintestinal virulence-associated genes, we also applied a mouse lethality assay to the strains [32] to quantify the intrinsic virulence of the strain, excluding host variability and other potential confounding factors (Table 1). Our goal was to take the outstanding opportunity provided by the availability of many genomes of a single bacterial species, regarding which a considerable amount of knowledge has been accumulated over the years, to answer to the following questions. (i) Is there genome-wide evidence of frequent recombination and does it vary with genome location? (ii) If so, can one nonetheless infer an intra-specific bacterial phylogeny? (iii) How do the different factors of genome dynamics (mutation, horizontal gene transfer with or without recombination) result together in strain diversification? (iv) Is genome dynamics in conflict with genome organisation? (v) How does the commensalism/pathogenicity duality evolve? 10.1371/journal.pgen.1000344.t001 Table 1 Principal characteristics of the 20 Escherichia coli/Shigella strains and 1 E. fergusonii strain. Strains Host Sample Clinical condition (Pathotypea) Phylogenetic groupb Extraintestinal mouse model phenotypec (Number of mice killed out of 10) Genome sequence reference K-12 MG1655 Human Faeces Commensal A NK (0) [115] K- 12 W3110 Human Faeces Commensal A NK (0) Nara Institute of Science and Technology IAI1 Human Faeces Commensal B1 NK (0) This work 55989 Human Faeces Diarrhoea (EAEC) B1 K (10) This work S. boydii 4 227 (Sb 227) Human Faeces Shigellosis S1 NDd [116] S. sonnei 046 (Ss 046) Human Faeces Shigellosis SS ND [116] S. flexneri 2a 301 (Sf 301) Human Faeces Shigellosis S3 ND [117] S. flexneri 2a 2457T (Sf 2457T) Human Faeces Shigellosis S3 NK (0) [118] S. flexneri 5b 8401 (Sf 8401) Human Faeces Shigellosis S3 ND [119] S. dysenteriae 1 197 (Sd 197) Human Faeces Shigellosis SD1 ND [116] O157:H7 EDL933 Human Faeces Diarrhoea (EHEC) E NK (1) [120] O157:H7 Sakai Human Faeces Diarrhoea (EHEC) E NK (1) [121] UMN026 Human Urine Cystitis (ExPEC) D K (10) This work IAI39 Human Urine Pyeloneprhitis (ExPEC) D K (8) This work UTI89 Human Urine Cystitis (ExPEC) B2 K (10) [122] APEC O1 Chicken Lung Colisepticemia (ExPEC) B2 K (10) [123] S88 Human Cerebro-spinal fluid New born meningitis (ExPEC) B2 K (10) This work CFT073 Human Blood Pyeloneprhitis (ExPEC) B2 K (10) [30] ED1A Human Faeces Healthy subject B2 NK (0) This work 536 Human Urine Pyeloneprhitis (ExPEC) B2 K (10) [124] E. fergusonii Human Faeces Unknown Outgroup NK (1) This work The strains in bold correspond to the strains sequenced in this work. a EAEC (enteroaggregative E. coli), EHEC (enterohaemorrhagic E. coli), ExPEC (extraintestinal pathogenic E. coli). b The E. coli and Shigella phylogenetic groups are as defined in [22] and [6], respectively. c K, killer; NK, Non Killer [32]. d ND, not determined. Results/Discussion The General Features of the Seven Sequenced Genomes We fully sequenced the chromosomes and the plasmids, if any, of 6 strains of E. coli and the reference type strain of E. fergusonii. The general features of these replicons are listed in Tables 2 and 3. Genomes were sequenced at an average of 12-fold coverage and were then finished. The 6 newly sequenced E. coli chromosomes contain between 4.7 Mb and 5.2 Mb each, corresponding to between 4627 and 5129 protein coding genes, slightly above the average value within the 20 genomes that we analyzed (∼4700 genes, ranging from 4068 to 5379). The chromosome of E. fergusonii is slightly smaller with ∼4.6 Mb and ∼4500 protein coding genes. The G+C content is very similar among the 6 strains and close to the E. coli K-12 MG1655 value (∼50.8%). The G+C content of E. fergusonii is lower at 49.9%. These chromosomes have similar densities of coding genes and numbers of stable RNA genes. By contrast, the number of pseudogenes varies more widely, from 22 in E. fergusonii to 95 in strain ED1a (Table 2). The list of pseudogenes is available in Table S1. 10.1371/journal.pgen.1000344.t002 Table 2 General features of the Escherichia coli and E. fergusonii genomes sequenced in this work with E. coli K-12 MG1655 as reference (chromosome features). Chromosome features E. coli K-12 MG1655 E. coli strains E. fergusonii ATCC 55989 IAI1 ED1a S88 IAI39 UMN026 Genome Size (bp) 4 639 675 5 154 862 4 700 560 5 209 548 5 032 268 5 132 068 5 202 090 4 588 711 G+C content (%) 50.8 50.7 50.8 50.7 50.7 50.6 50.7 49.9 rRNA operons 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) 7 (+5S) tRNA genes 86 94 86 91 91 88 88 87 Total Protein-coding genesa 4306 4969 4491 5129 4859 4906 4918 4336 Pseudogenesb (nb) 81 79 51 95 90 80 45 22 Protein coding densityc 85.7 87.4 87.6 86.2 87 86.1 87.8 84.7 Assigned functiond (%) 80 74 77 74 77 78 76.5 77 Conserved hypothetical (%) 12.5 23 21.5 23 22 20 22 20 Orphans (%) 7.5 3 1.5 3 1 2 1.5 3 IS-like genes (nb) 66 150 42 118 47 224 92 29 Phage-associated genes (nb) 231 406 201 657 507 393 429 235 a The number of protein-coding genes is given without the number of coding sequences annotated as artifactual genes (Supplementary Table 2A). b The number of pseudogenes computed for each genome corresponds to the real number of genes that are pseudogenes: one pseudogene can be made of only one CDS (in this case the gene is partial compared to the wild type form in other E. coli strains) or of several CDSs (generally two or three CDSs corresponding to the different fragments of the wild type form in other E. coli strains). These lists of pseudogenes are available in Supplementary Table 1. c The computed protein coding density takes into account the total length of protein genes excluding overlaps between genes, artifacts, and RNA genes. d Protein genes with assigned function include the total number of definitive and putative functional assignments. 10.1371/journal.pgen.1000344.t003 Table 3 General features of the Escherichia coli and E. fergusonii genomes sequenced in this work with E. coli K-12 MG1655 as reference (plasmid features). Plasmid features E. coli strains E. fergusonii ATCC 55989 ED1a S88 UMN026 Genome Size (bp) 72 482 119 594 133 853 122 301 33 809 55 150 G+C content (%) 46.1 49.2 49.3 50.5 42 48.5 Total Protein-coding genesa 100 150 144 149 49 54 Pseudogenesb (nb) 7 11 9 8 0 5 Protein coding densityc 75.6 86.2 87 79.4 87.5 88.7 Assigned functiond (%) 74 53 65 65.7 35.4 46.6 Orphans (%) 17 31.5 25.8 27.8 12.5 20.7 Hypothetical (%) 9 15.5 9.2 6.5 52.2 32.7 IS-like genes (nb) 18 14 14 15 0 4 a The number of protein-coding genes is given without the number of coding sequences annotated as artifactual genes (Supplementary Table 2A). b The number of pseudogenes computed for each genome corresponds to the real number of genes that are pseudogenes: one pseudogene can be made of only one CDS (in this case the gene is partial compared to the wild type form in other E. coli strains) or of several CDSs (generally two or three CDSs corresponding to the different fragments of the wild type form in other E. coli strains). These lists of pseudogenes are available in Supplementary Table 1. c The computed protein coding density takes into account the total length of protein genes excluding overlaps between genes, artifacts, and RNA genes. d Protein genes with assigned function include the total number of definitive and putative functional assignments. The variation in the number of pseudogenes is uncorrelated with the number of transposable elements and phage-associated genes, which vary in the range 42–224 and 201–517 respectively. While some phage-associated genes are scattered throughout the chromosomes, the majority are concentrated in well-defined prophage regions. Analyses of the prophages suggest that many may still be functional. These prophages often carry at their extremity some unrelated cargo genes that probably arose from genomes of previously infected bacteria, as found in Salmonella [33]. We sequenced a total of 6 plasmids, varying in size from 34 to 134 kbp: four strains possess one plasmid each whereas one strain has 2 plasmids (Table 3). As frequently noted, the plasmids have a lower gene density (84%, vs. 87% for chromosomes), lower G+C content (47.4%, vs. 50.7% for chromosomes) and more pseudogenes (2.7%, vs. 1.5% for chromosomes). The percentage of orphan proteins (i.e., having no detectable homolog in other organisms) is also high on plasmids (6.5 to 52.2%), while it ranges between 1–3% on the chromosomes. A manual expert annotation of the new E. coli strains was performed on genes and regions not found in E. coli K-12 MG1655 (about 10 000 genes in total; Table S2A). This allowed the re-annotation of orthologs in the previously available Escherichia and Shigella genomes (see Materials and Methods). The annotation data, together with the results of the comparative analysis were stored in a relational database called ColiScope, which is publicly available using the MaGe Web-based interface at http://www.genoscope.cns.fr/agc/mage. This re-annotation process revealed extensive variations in the number of the newly predicted genes (Table S2B). For example, between the two strains of E. coli O157:H7 we found twice as many newly predicted genes in one strain as in the other. In some genomes important genes were missing. For example, in E. coli APEC O1 several subunits of the ribosome, DNA polymerase III, and ATP synthase were missing in the original annotation (Table S3, E. coli APEC sheet). In other genomes, the re-annotation allowed us to standardise the definition and identification of pseudogenes. For example, in S. sonnei Ss 046 most of the newly annotated genes correspond to insertion sequences (ISs) and small fragments of incompletely annotated pseudogenes (Table S3, S. sonnei sheet). As a result of this effort, the present ColiScope database contains a complete and consistent set of annotations for the 7 newly sequenced genomes and the 14 available Escherichia and Shigella genomes. These data were the starting point of the work presented here. We analyzed gene order conservation within the 21 genomes (Table S4). More than half of the genomes have exactly the gene order of E. coli K-12 MG1655, which we inferred as ancestral. Thus, the organisation of the core genome is stable in most strains. Three genomes show 1 or 2 rearrangements. Seven genomes show more than 10 blocks of synteny: 6 of these genomes are from Shigella, the high rearrangement rates of which resulted in up to 65 blocks of synteny in S. dysenteriae. These genomes have a large number of ISs, ranging from 549 to 1155 in S. flexneri and S. dysenteriae, respectively, which are well known to shuffle genomes. E. fergusonii also shows a large number of rearrangements relative to the ancestral organization of the E. coli genome. Since the organisation of some strains of the more distantly related Salmonella enterica closely resembles that of E. coli K-12 MG1655, many rearrangements must have taken place in the branch leading to E. fergusonii. Figure S1 provides the classical concentric circle representation for the 7 genomes we sequenced, showing GC skews, G+C variation, and a description of the presence of genes in ever-increasing clades within the genus, relative to the inferred ancestral genome. The first position of the sequences was chosen to match the orthologous region in the E. coli K-12 MG1655 genome and corresponds to the intergenic region between lasT and thrL. Origins and termini of replication were identified by GC skews and homology with the respective E. coli K-12 MG1655 regions. These figures show that divergence from the average G+C content often occurs in genomic regions absent in the other strains. They also reveal the highly mosaic structure of these genomes, comprising the core genes and the accessory genes, which we then set out to quantify. The Core and Pan-Genomes of E. coli The analysis of the first E. coli genomes changed our views about the evolution of gene repertoires in bacteria. Genomes within the species vary in size by more than 1 Mb, i.e., by more than 1000 genes, and even the gene repertoires of similarly sized genomes differ widely [30],[34]. We have thus taken advantage of the unprecedented availability of 20 completely sequenced genomes of the same species to analyse the evolution of the gene repertoire. We first identified the core and pan-genomes of E. coli, i.e., the genes present in all genomes and the full set of non-orthologous genes among all genomes. In our data set, the average E. coli genome contains 4721 genes, the core genome contains 1976 genes, and the pan-genome contains 17 838 genes. The random sampling of one gene within a randomly selected E. coli genome has a probability of only ∼42% of revealing a ubiquitous gene. On the other hand, the full sequencing of an E. coli strain allows observation of only one-fourth of the observed pan-genome. This implies that although some fundamental functions can be well studied by using a model strain, no single strain can be regarded as highly representative of the species. Further sampling of E. coli genomes is unlikely to change significantly the estimate of the core genome, however, the pan-genome is far from being fully uncovered (Figure 1). Annotation and sequencing artefacts may affect the estimations of core and pan-genome sizes, e.g. by spurious annotation of small genes or pseudogenes. We hope to have minimised such problems by using a coherent set of annotations. Still, we found that 40 genes deemed essential in E. coli K-12 W3110 [35] were missing in the core genome. Among these, 17 correspond to genes with conflicting reports of essentiality, or contextually essential genes such as prophage repressors, and are absent in most genomes. The other 23 genes have orthologs in most genomes and 19 are missing in a single genome where they can be found as pseudogenes interrupted by a single-nucleotide frameshift. While “pseudogenisation” does often start with such frameshifts [36], these genes correspond to core housekeeping functions, so the reported frameshifts probably represent sequencing errors. For example, it is hard to see how S. boydii could replicate without the catalytic α-subunit of the DNA polymerase III or how E. coli 536 could survive without a tyrosine tRNA synthetase. We found some comfort in verifying that none of the 23 genes was absent from the 7 genomes we sequenced. If one assumes that these essential genes cannot be deleted and that no special care has been taken to check for sequencing errors at these loci, then our estimation of the core genome should be increased by a factor of 260/(260-23) to 2167 genes. This still makes the core genome less than half of the average E. coli genome (∼46%). Importantly, no gene of the core genome, nor any operon ubiquitous in E. coli, was unique to the species, i.e., we could always find a homolog in at least one of the other fully sequenced bacterial genomes. 10.1371/journal.pgen.1000344.g001 Figure 1 Escherichia coli core and pan-genome evolution according to the number of sequenced genomes. Number of genes in common (left) and total number of non-orthologous genes (right) for a given number of genomes analysed for the different strains of E. coli. The upper and lower edges of the boxes indicate the first quartile (25th percentile of the data) and third quartile (75th percentile), respectively, of 1000 random different input orders of the genomes. The central horizontal line indicates the sample median (50th percentile). The central vertical lines extend from each box as far as the data extend, to a distance of at most 1.5 interquartile ranges (i.e., the distance between the first and third quartile values). At 20 sequenced genomes, the core-genome had 1976 genes (11% of the pan-genome), whereas the pan-genome had (i) 17 838 total genes (black), (ii) 11 432 genes (red) with no strong relation of homology ( 0.05), suggesting that our inference is unbiased with respect to successive events taking place at the same locations in long branches or by selection-purging older events in internal branches. Variation in gene repertoires has been described as being scattered on the chromosome of E. coli and balanced between the two replichores [61]. For the numerous small insertions and deletions this distribution results naturally from random insertion/deletion of genetic material. Such small indels are expected to have little impact on the large-scale organisation of the genome. What about the very large insertions/deletions? The 554 such events that involve more than 10 genes over-represent insertions over deletions (Fisher exact test, p 0.05). This shows that insertion/deletion hotspots are also rearrangement hotspots, even though we initially removed rearranged positions to identify the insertion/deletion hotspots (thus being conservative). It also suggests that rearrangements occur in these regions because they are permissive to change not because they are intrinsically recombinogenic, since the frequency with which they rearrange simply reflects their larger size. However, even if hotspots are not intrinsically recombinogenic they can still be caused by the brokering effect of homologous recombination. Indeed, incoming DNA once integrated in one genome can propagate within the population by lateral transfer via classical homologous recombination involving the homologous flanking regions. Given the observed rates of recombination in the species, this mechanism could quickly lead to the horizontal spread of highly adaptive newly acquired genes. We describe some evidence for this in the next section. Hotspots of Phylogenetic Incongruence For any given sequence alignment, the likelihood of the overall gene tree topology, i.e., the phylogenetic congruence, reflects the extent to which the phylogenetic signal of the sequences was altered by recombination. While the concatenate of genes provides a strong phylogenetic signal, the individual genes' histories can be very diverse as a result of recombination. Furthermore, these histories may depend on the genes' positioning in the chromosome. Notably, if homologous recombination helps in disseminating recent acquisitions, as we propose, the core genome around these hotspots should show signs of recombination as indicated by phylogenetic incongruence. We therefore made an analysis in 5 kbp sliding windows along the multiple genome alignment to identify the most phylogenetically incongruent regions (see Material and Methods). This method identified two large regions of very strong incongruence, one centred around rfb (Figure S8), the operon involved in O antigen synthesis, and the other around the leuX tRNA gene, and including fimA, which is under diversifying selection and is involved in the adhesion of bacteria to host cells [66]. Both loci were previously identified as hotspots of phylogenetic incongruence [67],[68]; the present analysis reveals how much they affect the chromosome. Recombination at the rfb locus significantly affects congruence within a striking 150 kbp surrounding region, i.e., from positions 1988 kbp to 2138 kbp (100% of windows tested had scores lower than 1.96 standard deviation away from the average, with an average of −4.84 and peaks at −10.19). The fim locus includes an incongruence region close to 200 kbp in length (from positions 4421 kbp to 4618 kbp, average −2.54 standard deviation and 73% with lower than −1.96 standard deviation and peaks at −6.65). Interestingly those two regions are centered on integration hotspots and encompass 11 of the 133 hotspots of integration. The genes present in such loci arose most likely by lateral transfer since they are highly dissimilar between strains. For example, genes at the rfb locus genes can exhibit less than 50% similarity, while the leuX locus encompasses a highly variable assortment of non-homologous inserts in all the genomes sequenced. Hence at least for those two major loci we find a striking link between hotspots of integration and hotspots of homologous recombination. In the case of the rfb locus, it is worth noting that the incongruence signal we observe might be a composite signal, due not only to rfb but also to neighbouring loci. Within the above defined rfb region of incongruence, a flagella locus (fli operon) associated with two hotspots of integration is also under diversifying selection. Moreover, the high pathogenicity island (HPI) is integrated within that high recombination region in many isolates and corresponds also to a hotspot of integration. It has been suggested that after a recent and unique integration event, the HPI has propagated within the species by homologous recombination [69]. The propagation or diversification of these loci, located to the left of rfb, through homologous recombination might generate the asymmetrical pattern of phylogenetic incongruence we observe around the rfb locus (extended incongruence on the left side of the rfb locus) (Figure S8). We found 23 other regions with weaker signatures of incongruence (i.e., with a 5 kbp sequence incongruence score more than 2 standard deviations from the average), each spanning less than 20 kbp. It is important to note that most of these incongruent regions include genes involved in diversification of genetic information and often pathogenicity. The vast majority of these include 3 groups of common genes. First are regions with the porin-encoding genes ompA and ompC, the flagella-encoding genes, the rfa locus coding for the core lipopolysaccaride and genes coding for several membrane proteins such as LolCDE, CcmABCDE, ABC transporter, AroP APC transporter, LplT-aas, FadK, YeaY, EamB, YhgE and YicG membrane proteins. These loci are probably involved in diversifying selection since they code for antigenic proteins exposed at the cell surface. Second are two regions encompassing mismatch repair genes (mutS and mutH) that have been shown to be under selection for cycles of inactivation and reacquisition through recombination [70]. Third is a region associated with the integration of a locus that can provide resistance to phages through clustered regulatory interspaced short palindromic repeats (CRISPRs) [71]. All available methods estimate the effective, not the intrinsic, recombination rate. Effective recombination results from the intrinsic recombination rate and the ensuing selection on recombinants. Most of the phylogenetic incongruence hotspots we found contain genes under diversifying selection, for instance to escape immune pressure or to acquire resistance to phage. Hence, it is very likely that differences in the intensity of selection might be responsible for the differences observed in the size of the regions affected by a phylogenetic incongruence hotspot. A recombinant carrying a new allele at a locus under strong diversifying selection will be selected and thus will increase rapidly in frequency in populations. Hence, the recombinant will invade the local population before any further recombination occurs at the locus [72]. In that case, sampling the genome after the action of natural selection allows identification of the original recombining fragment. In contrast, if selection is moderate, the recombining fragment that brought the interesting allele into the genome will be covered by many further recombination events before it reaches high frequency. In this case, only fragments around the selected allele will retain the trace of the recombination event. As a consequence, when selection is intense, one expects to identify long recombinant fragments in some strains, as we did at the rfb or leuX loci. Our observations suggest that the intensity of diversifying selection acting on the rfb and leuX-fimH loci are under very strong selective pressure compared with the diversifying selection acting on the core LPS, the flagella or some of the porins. The fact that most hotspots of integration (117 among 133) do not result in hotspots of phylogenetic incongruence suggests that they carry neutral or deleterious genes. Conversely, it also suggests that some horizontally acquired genes can be highly beneficial (e.g., 11 hotspots of phylogenetic incongruence around the rfb or leuX-fimH locus) or moderately beneficial (e.g., 4 hotspots of integration associated with hotspots of phylogenetic incongruence) and that this results in different selection footprints in the neighbouring core genome. Recombination and Chromosome Organisation The existence of integration and phylogenetic incongruence hotspots brings to the fore the conflict between genome dynamics and organisation. We therefore analysed the variation in recombination along the backbone sequence (as estimated by a population genetic-based approach), using a sliding window of 3 kbp on the multiple genome alignment and a step size of 500 bp. This analysis revealed a large region around the terminus of replication with a particularly low ratio between gene conversion and mutation rates (Cgc/theta) (Figure 10). The region between 1 Mb and 2 Mb shows lower gene conversion rates, since there is 20% lower chance for a base to be involved in a gene conversion event (Cgc×Lgc, unilateral t test: p = 1e-21). This region also shows 10% lower levels of polymorphism (theta of Watterson, p = 1e-7), i.e., variations within the E. coli species, and 2% lower G+C content (Figure 10). A+T richness at the terminus region has been suggested to result from higher mutation rates [73]. Based on comparative genomics with Salmonella, it was also shown that divergence, i.e., the genetic distance between species, slightly increased closer to the terminus [74],[75], further supporting the hypothesis of a higher local mutation rate. Using our newly sequenced outgroup genome E. fergusonii, which unlike Salmonella does not shows saturation of synonymous substitutions, we found that the terminus domain has synonymous and non-synonymous substitution rates twice as high as the rest of the chromosome. While decreased G+C content and increased divergence could reflect a higher mutation rate at the terminus, such an interpretation is contradicted by the observed lower polymorphism. 10.1371/journal.pgen.1000344.g010 Figure 10 Standardized cumulative sum of effective gene conversion rate and G+C content. Gene conversion rate (i.e., probability of being involved in a gene conversion event Cgc.Lgc) is shown in blue, and G+C content in red. A decrease in the cumulative sum reflects regions of lower-than-expected values of the statistics. Around the terminus domain, we found a decrease in both recombination and G+C content. Coloured boxes represent the 4 different organisation macrodomains (Right, Ter, Left, Ori). The arrows point towards the origin and terminus of replication. Theoretical population genetic studies have shown that the fluctuation of recombination frequency along chromosomes affects the level of polymorphism and the efficiency of selection [76]. When there are numerous deleterious mutations and low recombination rates, a fraction of the population bearing deleterious alleles is doomed to disappear in the long term without contributing to the gene pool of the future population. The relevance of this phenomenon, referred to as background selection, requires the existence of deleterious mutations of moderate effects, i.e., mutations that can persist for some time in the population before selection wipes them out. At the population level, this result in an excess of rare alleles, which can be estimated by Tajima's D statistics. We found that overall a gene's average Tajima's D was slightly negative (indicating an excess of rare alleles). However, the Tajima's D of synonymous mutations was null, while that of non-synonymous mutations was much more negative (Figure S9). This suggests that most non-synonymous mutations are deleterious since, in contrast to synonymous mutations, they do not increase in frequency within the population reflecting the purging effect of natural selection. Therefore the conditions for the action of background selection are met. Furthermore, under background selection, a reduced recombination rate results in a decreased polymorphism (such as we observed around the terminus), an increased fraction of rare alleles and a decreased efficiency of selection [76]. The terminus region shows a lower Tajima's D than the rest of the chromosome (Student bilateral test, p 8 mice killed) [32]. Supporting Information Figure S1 Circular representation of the six Escherichia coli genomes and the E. fergusonii genome. Circles display from the inside out: (1) GC skew (G+C/G−C using a 1 kbp sliding window). (2) Location of tRNA genes, rRNA operons and Insertion Sequences (ISs). (3) GC deviation (difference between the mean GC content in a 1 kbp window and the overall mean GC). Red areas indicate that the deviation is greater than 2 standard deviations. (4) Ancestral E. coli genome. Yellow areas denote genes that are present in all the genomes under study. (5) Scale. (6) Gene specificity at strain level. Genes sharing at least one homolog in an other E. coli strain of the same phylogenetic group and having more than 85% identity over at least 80% of their length were regarded as non specific. To simplify the visualisation of specific regions, we created a colour gradient that denotes the percentage of organisms that possess a homolog of a given gene within the reference genome. If this particular gene is present in all the organisms under study, it is tagged in light grey. Conversely, if it is present only in the reference genome, it is tagged in dark colour. In other words, the more pronounced the colour, the higher the specificity. (7) Gene specificity at group level. The same criteria were used as for circle (6) but the genome analysed is compared to E. coli strains that belong to other phylogenetic groups. The comparison includes Shigella as well. (8) Gene specificity at the species level. The same protocol was used as for circles (6) and (7) except the comparison involves E. fergusonii which is considered as the outgroup for this study. (1.82 MB PPT) Click here for additional data file. Figure S2 Visual representation of MAUVE multiple alignment of 20 Escherichia coli/Shigella genomes. The representation was performed using the MOSAIC database (http://genome.jouy.inra.fr/mosaic/) multiple alignment viewer. Horizontal lines correspond to a linear representation of each genome sequence drawn to scale. The blue line corresponds to annotated genes. (At this scale only a unique line is visible.) Coloured blocks correspond to the locally collinear blocks (LCBs) of the alignment as defined by MAUVE. LCBs corresponding to inversions are represented on a second line. An LCB in one genome is linked to the corresponding LCB on the subsequent genome with a plot of the same colour. This visual representation shows that, apart from the rearrangements present in Shigella chromosomes, E. coli genomes are mostly collinear. (0.11 MB PPT) Click here for additional data file. Figure S3 Phylogenetic tree of the backbone of the 20 Escherichia coli and Shigella strains as reconstructed by MAUVE software. This unrooted tree was built using Tree-puzzle with the HKY+gamma (with 8 categories)+I model followed by BioNJ to reconstruct the tree from the distance matrix. The values at the nodes correspond to support values for each internal branch, as estimated by Tree-puzzle (range 0–100), and can be interpreted in much the same way as bootstrap values. (0.09 MB PPT) Click here for additional data file. Figure S4 Association between gene repertoire relatedness and phylogenetic distance. A. Genomes were binned according to phylogenetic distance for clarity. For the first two bins, which correspond to the most closely related genes, there is a high percentage of genome in common, which is not the case for the other bins, which correspond with more distantly related genes. B. Histogram of the phylogenetic distances between pairs of genomes. (0.04 MB PPT) Click here for additional data file. Figure S5 Reconstruction of gains and losses of genes in the evolution of Escherichia coli. The cladogram shows the phylogenetic relationships between the 20 E. coli/Shigella genomes rooted on the E. fergusonii genome, as in Figure 4, with branch lengths ignored for clarity. Each strain and internal node of the tree is labelled with the inferred numbers of genes gained (red: top) and lost (black: top), and the inferred numbers of corresponding events of gene acquisition (red: bottom) and loss (black: bottom) along the branch. Pie charts on each branch indicate the functional classification of genes lost, using the colour-scale (details in the keys). The functional classes of known-function genes are represented by numbers explained by a key in Supplementary Table 4. (0.31 MB PPT) Click here for additional data file. Figure S6 Association between the distance of a node to the tip of the tree and the difference between the predicted ancestral genome size and the effective number of genes reliably predicted to be present in the node. The association is highly significant (R2 = 0.56, p<0.001). (0.03 MB PPT) Click here for additional data file. Figure S7 Characteristics of hotspots of insertion/deletion of genetic material. The circles indicate the values per location between contiguous genes in the core genome. Data are (from the outside circle inwards): average number of genes, standard deviation, sum of genes, number of prophage-like elements, number of insertion sequence like elements, sum of tRNA genes and heterogeneity rate at hotspots. The latter is the ratio between the observed number of orthologs and the expected value if all genes had orthologs in all genomes, after excluding genomes lacking genes at the hotspot and those for which the region has a synteny breakpoint. (0.14 MB PPT) Click here for additional data file. Figure S8 Phylogenetic congruence at the rfb locus coding for O antigen. We followed the likelihood of the species topology for 5 kbp windows (spaced by 250 bp) along the chromosomal backbone. After correcting for the number of polymorphic sites, each window received a Z score of phylogenetic congruence. Low values reflect lower than average phylogenetic congruence. A large region (green arrow) has a significantly lower congruence than the rest of the genome. The red arrows indicate the hotspots of integration and the corresponding loci when identified. HPI: high pathogenicity island. (0.03 MB PPT) Click here for additional data file. Figure S9 Distribution of Tajima's D statistics on the 1976 Escherichia coli core genome genes. The colour code is as follows: all mutations (red), synonymous mutations (green) and non-synonymous mutations (yellow). Negative Tajima's D values [126] reflect a higher than expected frequency of rare alleles. The more negative value of Tajima's D for non-synonymous mutations suggests that they are on average deleterious: they persist some time in populations before selection removes them. (0.08 MB PPT) Click here for additional data file. Table S1 Pseudogenes found in Escherichia coli K-12 MG1655 and the 7 newly sequenced genomes of the ColiScope project. The ‘Reference’ column give the coding sequence (CDS) identifier of the wild type form of the gene, in one of the 21 analyzed Escherichia and Shigella bacterial genomes. For each of these 21 genomes the particular gene's status is indicated as functional (‘1’), absent (‘0’) or a pseudogene (‘−1’). Gene names in boldface correspond to genes that are pseudogenes only in the considered strain. (0.25 MB XLS) Click here for additional data file. Table S2 A) Number of predicted protein encoding genes in the genomes of the newly sequenced strains of Escherichia coli and E. fergusonii. Genes were (a) functionality annotated using automatic annotation transfer from K-12 MG1655 orthologs or other ColiScope manually annotated orthologous genes, (b) manually annotated using the MaGe web-based graphical interface, or (c) considered as false positive gene predictions. B) Publicly available Escherichia and Shigella genomes included in the ColiScope database. (a) Inaccurate (‘Wrong’ status) or missed gene annotations (‘New’ status) have been found using our MICheck procedure. For the 14 analyzed genomes, the list of newly predicted genes is given in Supplementary Table 3. (b) Automatic functional annotation transfer between orthologous genes (85% identity over at least 80% of the length of the smallest protein) began with similarity results obtained with E. coli K-12 MG1655, then with the new genomes of the ColiScope project. False gene predictions (i.e., artefacts) were those defined in the course of the expert annotation of the ColiScope sequences. (c) ‘Specific genes’ are genes that have no ortholog in E. coli K-12 MG1655 or any of the newly sequenced and annotated genomes. (0.05 MB DOC) Click here for additional data file. Table S3 Missing genes in publicly available Escherichia coli and Shigella genomes. The genes are ordered by length (given in base pairs). Those that are similar to genes from the minimal gene set defined by [127] are highlighted in boldface. Functional descriptions of the genes, starting with ‘fragment of’ (Product column), are provided for putative pseudogenes (whether actual pseudogenes or sequencing errors). For some E. coli stains, e.g., UTI89, the corresponding pseudogenes were probably correctly annotated by the authors (numbering of the gene locus_tags), but were not reported in the databank files (GeneBank and EMsBL), and thus were annotated as missing genes by the MICheck procedure [97]. (0.36 MB XLS) Click here for additional data file. Table S4 Synteny blocks and insertion sequence (IS) elements among the 21 Escherichia coli/Shigella/E. fergusonii genomes. (0.03 MB DOC) Click here for additional data file. Table S5 Classification of bioprocesses (key for Figure 7). The tests indicate the sense of the difference in the number of genes associated with a given bioprocess. ‘+/−’ means more/fewer genes in the first class, i.e., more/fewer in the core genome than in the complementary set. ‘++/−−’ means the difference is significant at the 5% level, using a Chi square test followed by a sequential Bonferroni correction for multiple tests. (0.02 MB DOC) Click here for additional data file. Table S6 Genes (and associated characteristics) categorically associated with certain phylogenetic groups or pathotypes. Principal characteristics of the genes were deduced from the annotation process. (0.23 MB DOC) Click here for additional data file. Table S7 Genes of the genomic island at the pheV tRNA insertion hot spot (see Figure 9). (0.08 MB XLS) Click here for additional data file.

0 comments Cited 456 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

GeneMark.hmm: new solutions for gene finding.

A. Lukashin (1998)

The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.

0 comments Cited 436 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation.

Samuel V. Angiuoli, Aaron Gussman, William Klimke … (2008)

The methodologies used to generate genome and metagenome annotations are diverse and vary between groups and laboratories. Descriptions of the annotation process are helpful in interpreting genome annotation data. Some groups have produced Standard Operating Procedures (SOPs) that describe the annotation process, but standards are lacking for structure and content of these descriptions. In addition, there is no central repository to store and disseminate procedures and protocols for genome annotation. We highlight the importance of SOPs for genome annotation and endorse an online repository of SOPs.

0 comments Cited 410 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2011

Publication date (Electronic): 30 June 2011

Volume: 12

Page: 272

Affiliations

[1 ]Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA

[2 ]Institute for Genome Sciences (IGS), University of Maryland Baltimore, Baltimore, Maryland 21201, USA

Article

Publisher ID: 1471-2105-12-272

DOI: 10.1186/1471-2105-12-272

PMC ID: 3142524

PubMed ID: 21718539

SO-VID: 30e47b46-3164-444b-98b5-4708791b9819

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 14 March 2011

Date accepted : 30 June 2011

Comments

Comment on this article

scite_

Cited by 22

See all cited by

Improving pan-genome annotation using whole genome multiple alignment

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Genome Engineering using CRISPR

Most cited references 22

Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths

GeneMark.hmm: new solutions for gene finding.

Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 289

Cited by 22