Introduction The enterococci are a diverse group of Gram-positive gastrointestinal (GI) tract colonizers with lifestyles ranging from intestinal symbiont to environmental persister to multidrug-resistant nosocomial pathogen (1, 2, 3). Enterococci are used in food production, in probiotic products, and for tracking fecal contamination and thus also are of regulatory and industrial interest. Most enterococcal research has focused on the two species most associated with human GI tract colonization and infection, Enterococcus faecium and Enterococcus faecalis (2, 3). Certain lineages, defined by multilocus sequence typing (MLST), are associated with hospital-acquired infections (e.g., E. faecium sequence type 17 [ST17] ST18, ST78, and ST203 and E. faecalis ST6, ST9, and ST40) (4). Genome analysis has illuminated the extent of mobile content (5) and evolution of antibiotic resistance (6) in E. faecalis ST6 strain V583 and the mobile element content and metabolic capabilities of E. faecium (7). Using genomic data we recently developed for 28 enterococcal strains (8), we report and quantify divergence within what is currently classified as E. faecium and E. faecalis and identify the genetic bases for the defining characteristics of the motile enterococcal species Enterococcus gallinarum and Enterococcus casseliflavus. We identify loci homologous to those known to direct the synthesis of extracellular polymers that interact with host surfaces, including a putative E. faecium capsule locus. We additionally identify genetic sequences and biochemical functions that represent distinguishing features of potential value for the rapid identification of enterococci to the species level. RESULTS AND DISCUSSION Phylogenetic analysis of enterococci. We recently announced the public release of genome sequence data for 28 enterococcal strains of diverse origin (8) (see Table S1 in the supplemental material). The 16 E. faecalis genomes sequenced represent the deepest nodes in the MLST phylogeny, providing the greatest diversity. The strains include those of clinical, animal, and insect origins and were isolated from 1926 to 2005 (9). These strains represent approximately 80 years of enterococcal evolution, spanning the periods prior to and during widespread antibiotic use. Additionally, the genomes of 6 E. faecium, 1 E. gallinarum, and 3 E. casseliflavus clinical isolates from 2001 to 2005 and 2 human fecal E. faecium strains were examined. OrthoMCL (10) was used to identify ortholog groups in the 30 enterococcal genomes. Ortholog groups represented in all 31 genomes were considered core groups, which were further subdivided into single-copy (1 gene copy in each genome) and multicopy (>1 gene copy exists in at least 1 genome). Genes not clustered were considered orphans. A phylogenetic tree generated from the concatenated sequences of 847 single-copy core genes is shown in Fig. 1. Relationships among the 18 E. faecalis strains, despite their diverse origins, cannot be fully resolved by this analysis (based on lack of bootstrap support for branches within the E. faecalis branch; inset, Fig. 1). As expected, E. casseliflavus and E. gallinarum branch separately, supporting their designation as different species. Importantly, two clades were identified within the species E. faecium, as had been inferred by comparative genome hybridization, which suggested that hospital-associated isolates, including ST17 and ST18 isolates, may make up a distinct subspecies within E. faecium (11). The 3 vancomycin-resistant E. faecium strains in our collection are members of clade A, while the 2 human fecal isolates are members of clade B (Fig. 1). To quantify the relationships among these strains, we generated average nucleotide identity (ANI) plots (Fig. 2), which have been used to query and refine prokaryotic species definitions (12, 13). FIG 1 Core gene tree. Concatenated sequences of 847 genes core to 30 enterococci and the outgroup species L. lactis were aligned, and a phylogenetic tree was generated using RAxML with bootstrapping. The bootstrap value for all nodes outside the E. faecalis clade is 100. E. faecium clades A (blue) and B (red) are indicated. FIG 2 ANI plot. Each point represents a pairwise comparison of two genomes. Grey diamonds, E. faecalis-E. faecalis comparisons. Blue circles, clade A E. faecium-clade A E. faecium comparisons. Red circles, clade B E. faecium-clade B E. faecium comparisons. Yellow circles, clade A E. faecium-clade B E. faecium comparisons. A species threshold of 94 to 95% ANI is indicated by the green-shaded area. E. faecium. The E. faecium ANI analysis refines phylogenetic relationships among clade A and clade B strains (Fig. 2). Within clade A, ST17 strain 410 and double-locus variants (DLVs) 933 (ST18) and 502 (ST203) are closely related (99.2 to 99.4% ANI) whereas strains 501 (ST52) and 408 (ST582, an ST17 DLV) have lower ANI values with those strains, and each other (96.9 to 98.2% ANI). Similar ANI values were observed among clade B strains (97.9 to 99.4%). However, pairwise comparisons of clade A and clade B strains ranged from 93.9 to 95.6% ANI, overlapping an ANI species line of 94 to 95%. ANI values of 94 to 95% correlate with experimentally derived 70% DNA-DNA hybridization values, a commonly accepted threshold for species designation (12, 13, 14). Clade A and clade B may be endogenous to the GI tracts of different hosts and now coexist among human flora as a result of antibiotic elimination of competitors, or clade A and clade B may be diverging from each other as a result of antibiotic use and ecological isolation (less likely because of the short time involved). For the 8 E. faecium strains in our collection, the two clades are recapitulated using the 7 housekeeping genes selected for E. faecium MLST (see Fig. S1 in the supplemental material). Between clade A and clade B strains, the nucleotide identities of concatenated MLST sequences range from 96.2 to 96.9% (compared to a 93.9 to 95.6% ANI range). To determine whether a single marker is representative of either E. faecium clade, we examined the distribution of individual MLST alleles among the E. faecium STs assigned to clade A or clade B (see Fig. S1 in the supplemental material). A “minority allelic population” composed of 5 divergent STs was reported in seminal E. faecium MLST work (15). The 5 divergent STs (ST39, ST40, ST60, ST61, and ST62) identified by that study belong to clade B (see Fig. S1 in the supplemental material). The genomes of 7 additional E. faecium strains were recently sequenced (7), and we used MLST to assign them to clade A or B (see Fig. S1 in the supplemental material). The assignment of one of these strains, E. faecium E980, to clade B is consistent with previous analyses demonstrating the phylogenetic distance of this strain from the other 6 (clade A) strains in that sequencing collection (7). In the first-pass analysis, the allele adk-6, which differs from the ST17 allele adk-1 at 3 synonymous sites, was observed to occur almost exclusively in clade B strains (see Fig. S1 in the supplemental material). To further explore the distribution of adk alleles among E. faecium isolates, we extracted sequences of all 617 available STs in the E. faecium MLST database and determined the extents of identity to an ST17 (clade A) reference. In the MLST database, adk-1, adk-5, and adk-6 are the most abundant adk alleles, representing 87% of the E. faecium STs. Of the 85 STs possessing adk-6, 66 (78%) share 96 to 96.9% nucleotide identity with ST17, comparable to that observed for clade B-ST17 comparisons. Conversely, adk-1 and adk-5 occur primarily in STs with ≥99% identity to ST17. These data suggest that adk allele exchange is restricted, perhaps resulting from a barrier to DNA uptake such as clustered regularly interspaced short palindromic repeats (CRISPR)-cas defense and/or from the proximity of adk to the replication origin (Fig. 3). FIG 3 E. faecium genome mosaicism plot. The outermost ring shows E. faecium Com12 scaffolds, ordered by decreasing length clockwise from scaffold 1, with each gene represented as a radial position along the ring. Each of the remaining 7 E. faecium genomes is represented by the rings below Com12. Genes are colored by membership in clade A (blue) or clade B (red), as determined by individual gene trees built from ortholog groups. The strains shown, from the outermost to the innermost rings, are Com12, 733, Com15, 501, 408, 502, 933, and 410. The locations of dnaA, Com12 MLST alleles, pbp5, and the EfmCRISPR1-cas locus are shown. E. faecium 408 is a DLV of ST17 that possesses adk-6 and ddl-13 (see Fig. S1 in the supplemental material). Because adk-6 occurs mostly among strains with lower relatedness to ST17, and ddl-13 is present in two clade B strains (see Fig. S1 in the supplemental material), we were curious about whether these alleles were acquired by recombination. Genome mosaicism is evident in E. faecium clade A strains 408 and 501 (Fig. 3). The occurrence of adk-6 and ddl-13 within a hybrid region in E. faecium 408 (Fig. 3; see data set S1 in the supplemental material) supports the acquisition of this region from a clade B strain. The putative genome defense system EfmCRISPR1-cas (16), present in 2 of 3 clade B strains and in E. faecium 408 (see Table S1 in the supplemental material), occurs within this region, suggesting that CRISPR-cas was acquired by E. faecium 408 from a clade B strain via recombination. The hybrid region in E. faecium 501 includes pbp5 (Fig. 3; see data set S1 in the supplemental material), which can confer ampicillin resistance. Our results indicate that pbp5-S was acquired by E. faecium 501 from a clade B strain. The hybrid region in 501 is flanked by a putative phage integrase (EFRG_00906) that is conserved among all of the E. faecium strains in our collection (see data set S1 in the supplemental material). We recently reported an Hfr-like mechanism for the transfer of chromosomal genes between E. faecalis strains (17), and it seems likely that a similar mechanism functions in E. faecium. To determine whether specific traits define the two E. faecium clades, we searched for clade-specific ortholog groups present in and exclusive to all of the members of each clade. We then used representative gene sequences from each to search for similar sequences in 7 additional E. faecium genomes (7) assigned to clade A or B (see Fig. S1 in the supplemental material). Of the clade A-specific genes (see data set S2 and Table S2 in the supplemental material), 8 are associated with a locus that has high sequence identity with and almost the same gene content as the ycjMNOPQRSTUV locus of Escherichia coli, which is significantly enriched in enteric clades (18) and also occurs in Listeria (19). The organization of this locus is similar to that of a Lactobacillus acidophilus fructooligosaccharide (prebiotic) utilization locus (20). Of the genes unambiguously assigned to clade B (see data set S2 and Table S2 in the supplemental material), 5 encode putative transcriptional regulators with protein domain hits to Mga or Rgg, regulators of virulence, competence, and cell-cell signaling in streptococci (21, 22). Two of these putative regulators are divergently transcribed from genes that are also clade B specific, including a putative thioredoxin that could modulate the redox state of cellular targets in response to oxidative stress (23). A putative phospholipase C is also clade B specific. Finally, one clade B-specific gene (EFSG_01746) was useful in identifying a genomic insertion, composed of 17 genes, in E. faecium 733 (see Table S2 in the supplemental material). This region encodes a putative phosphotransferase system and a secreted hyaluronidase that could cleave the extracellular matrix of host cells. It is surprising that clade B (and not clade A, which contains all high-risk STs) strains encode a number of secreted factors that could interact with eukaryotic cell surfaces. This suggests that clade B strains may be more closely associated with host tissues in the GI tract than clade A strains are, which possibly contributes to their persistence in the GI tract, whereas clade A strains may be more transient and associated with the GI lumen, which contributes to their dissemination. E. faecalis. In contrast to E. faecium, little phylogenetic divergence was observed among E. faecalis strains (Fig. 2). Among 306 pairwise comparisons, ANI varies within a narrow range (97.8 to 99.5%). Instead, shared gene content among these strains varies (70.9 to 96.5%). For example, strain T11 shares 96.5% of its 2,511 genes with ST6 strain V583, while V583 shares only 72.8% of its 3,265 genes with T11; they possess 99.5% ANI in the genes that they share. The genome size of T11 is smaller than that of V583 (2.74 Mb versus 3.36 Mb) and is similar to that of the oral isolate OG1RF (24), likely representing the minimal E. faecalis genome. For all 18 E. faecalis strains, genome sizes vary between the extremes of T11 and V583 (see Table S1 in the supplemental material). We recently proposed that loss of CRISPR-cas in founders of modern E. faecalis high-risk MLST lineages facilitated the influx of acquired antibiotic resistance genes and other mobile traits into these lineages (16). Genome size distribution significantly differs between strains possessing or lacking CRISPR-cas (P = 0.026; one-tailed Wilcoxon rank-sum test), with a greater average genome size in strains lacking CRISPR-cas (3.1 Mbp versus 2.9 Mbp). The distribution of domain motifs associated with mobile elements is significantly different in strains with genomes >3 Mb in size (P 2 was considered a positive result. Each strain was tested twice, and the data shown are for both trials. Ratios of 3 Mb), substantial differences in gene content exist. Ecotypes defined by specific mobile element cohorts may be identified within high-risk lineages or in lineages with variable CRISPR-cas status (e.g., ST40 and ST21 ). Finally, comparative genomics highlighted fundamental differences between E. casseliflavus and E. gallinarum. The importance of the occurrence of motility operons in both but of genes related to the formation and function of the c-di-GMP second messenger only in E. casseliflavus and the impact of motility on metabolism represent interesting areas for future exploration. MATERIALS AND METHODS Enterococcal strains and genome sequencing. E. faecalis strains were selected for genome sequencing to represent the diversity of a collection of 106 isolates previously characterized (9). The E. faecalis V583 and OG1RF genome sequences were previously reported (5, 24). The E. casseliflavus, E. gallinarum, and 6 E. faecium strains were obtained from a repository of clinical isolates (Eurofins Medinet). E. faecium Com12 and Com15 were isolated from feces of healthy human volunteers under Schepens Eye Research Institute Institutional Review Board protocol 2006-02, Identification of Pathogenic Lineages of E. faecalis. E. faecium STs were previously determined (16, 55), and E. faecium MLST data were accessed at http://efaecium.mlst.net. The sequencing, assembly, annotation, and rapid public release of these genome sequences have been previously described (8). Standard analyses, OrthoMCL, and EnteroCyc. Orthologous gene groups were identified using OrthoMCL (10), with an all-versus-all BLAST cutoff of 1E−5. Lactococcus lactis subsp. cremoris SK11 plasmid (NC_008503 to NC_008507) and chromosomal (NC_008527) genes were included as the outgroup. Coding sequences were aligned using Muscle (56), and poorly conserved regions were trimmed using trimAI (57). All trimmed alignments were concatenated and used to estimate phylogeny using maximum likelihood and 1,000 bootstrap trials as implemented by RAxML (58) using the rapid-bootstrapping option and the GTRMIX model. Conserved protein domains were predicted using HMMER3 (59) to search the Pfam (release 24; http://pfam.janelia.org) (60) and TIGRfam (release 10) (61) databases. The statistical significance of differences in genome size and conserved protein domain distribution was assessed using the one-tailed Wilcoxon rank sum test. Membrane helix predictions were generated with transmembrane protein topology with a hidden Markov model (14). Protein subcellular localization predictions were generated using PsortB (62). Sequence alignments and phylogenetic trees in the figures in the supplemental material were generated with ClustalW in MacVector. Enzyme Commission (EC) numbers for the proteins in EnteroCyc (http://enterocyc.broadinstitute.org/) were predicted using gene coding sequences (CDS) and BLASTX to search the KEGG database (release 56) (63) and assigning EC numbers based on the KEGG annotation. Only significant hits with an E value of <1E−10 and 70% overlap were considered. Pathways, operons, transporters, and pathway holes were predicted using the Pathway Tools software suite (64, 65). Unless otherwise noted, BLASTP and nucleotide megaBLAST queries were executed against the NCBI nonredundant protein sequence, nucleotide collection, and whole-genome shotgun read databases using NCBI BLAST. Proteins encoded by the E. casseliflavus EC10 motility locus were compared to a B. subtilis 168 reference using BLASTP (see data set S3 in the supplemental material); the B. subtilis 168 flagellum is a reference Gram-positive flagellum in the KEGG database (http://www.genome.jp/kegg-bin/show_pathway?bsu02040). ANI and shared-gene analyses. OrthoMCL ortholog groups were used to determine shared gene contents in pairwise genome comparisons. For a genome pair (genome 1 and genome 2), the total number of genes in genome 1 was determined and the number of genes in genome 1 shared with genome 2 (based on shared ortholog group membership) was determined. Percent shared gene content was calculated by dividing the number of genome 1 genes shared with genome 2 by the number of genes in genome 1. Nucleotide alignments of shared genes were used to determine the numbers of identical and different nucleotide residues in shared genes. For comparisons within species, at least 2,113 gene sequences were utilized. Percent ANI was calculated by dividing the number of identical nucleotide residues in shared genes by the total number of nucleotide residues. Recombination analysis. See the Text S1 in the supplemental material for a description of the methods used for genome mosaicism analysis and plot generation. Biolog analysis. A subset of strains (8/18 E. faecalis, 6/8 E. faecium, 3/3 E. casseliflavus, and 1/1 E. gallinarum) representing the diversity of the collection were analyzed in duplicate by Biolog Phenotype microarrays in accordance with the manufacturer’s instructions. Optical density at 590 nm (OD590) was read using a synergy 2 microplate reader (Bio-Tek). The 48-h OD590 reading of each well containing a carbon source was divided by the OD590 value obtained for the negative-control well. A ratio which gave a reproducible value of 2× the background was considered to be a positive result. SUPPLEMENTAL MATERIAL Text S1 Supplemental methods. Expanded methods for genome mosaicism analysis and plot generation. Download Text S1, DOCX file, 0.1 MB. Text S1, DOCX file, 0.1 MB Data set S1 E. faecium 408 and E. faecium 501 mosaic genes. Download Data set S1, XLSX file, 0.1 MB. Data set S1, XLSX file, 0.1 MB Data set S2 E. faecium clade-specific ortholog group nucleotide BLAST analysis against 7 additional sequenced E. faecium isolates. Download Data set S2, XLSX file, 0.1 MB. Data set S2, XLSX file, 0.1 MB Data set S3 Motile enterococcus BLAST and Pfam analyses. Download Data set S3, XLSX file, 0.1 MB. Data set S3, XLSX file, 0.1 MB Data set S4 Extracellular polymer biosynthesis BLAST and Pfam analyses. Download Data set S4, XLSX file, 0.1 MB. Data set S4, XLSX file, 0.1 MB Figure S1 E. faecium MLST tree. Sequences were downloaded from the E. faecium MLST database and aligned using ClustalW in MacVector. A phylogenetic tree with bootstrapping (1,000 replications) was generated by the unweighted-pair group method using average linkages. For each ST, MLST allele profiles were extracted from the database and are shown on the right. adk-6 alleles, identified as being highly specific to clade B, with the exception of clade A strain 408, are in red. ST39, ST40, ST60, ST61, and ST62 are the minority allelic population identified in reference 15. Download Figure S1, PDF file, 0.1 MB. Figure S1, PDF file, 0.1 MB Figure S2 Alignment of CrtM proteins from S. aureus Newman, E. gallinarum, and E. casseliflavus. Substrate interaction residues are indicated by dots. Two DxxxD motifs (red boxes) and Mg2+ interact with the diphosphates of the farnesyl-diphosphate molecules and with inhibitors. Identical residues are shaded, and similar residues are shaded lightly. Substrate interaction data are from reference 36. Download Figure S2, PDF file, 0.1 MB. Figure S2, PDF file, 0.1 MB Figure S3 E. faecalis and E. faecium epa loci. The core epa genes and downstream variable regions are shown for 18 E. faecalis and 8 E. faecium strains. Conserved anchor genes flanking the epa core genes and the variable regions are indicated. Variable-region genes are colored by annotations and by BLASTP and Pfam conserved-domain hits as shown in data set S4. Multiple Pfam domains were collapsed into categories (for example, glycosyltransferases). Only the most abundant Pfam categories are shown. Orphan genes not grouped by OrthoMCL are indicated. Contig gaps in scaffolds are indicated by black bars; and the size of each black bar is proportional to the number of N’s inserted during genome assembly. In E. faecalis ATCC 4200, a scaffold gap occurs in the epa variable region (indicated by vertical slashes). The drawing is to scale, and a scale bar is shown. Download Figure S3, PDF file, 0.1 MB. Figure S3, PDF file, 0.1 MB Table S1 Bacterial strain information. Table S1, DOC file, 0.1 MB. Table S2 E. faecium clade-specific genes. Table S2, DOC file, 0.1 MB.