58
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Identification of Salmonella for public health surveillance using whole genome sequencing

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          In April 2015, Public Health England implemented whole genome sequencing (WGS) as a routine typing tool for public health surveillance of Salmonella, adopting a multilocus sequence typing (MLST) approach as a replacement for traditional serotyping. The WGS derived sequence type (ST) was compared to the phenotypic serotype for 6,887 isolates of S. enterica subspecies I, and of these, 6,616 (96%) were concordant. Of the 4% ( n = 271) of isolates of subspecies I exhibiting a mismatch, 119 were due to a process error in the laboratory, 26 were likely caused by the serotype designation in the MLST database being incorrect and 126 occurred when two different serovars belonged to the same ST. The population structure of S. enterica subspecies II–IV differs markedly from that of subspecies I and, based on current data, defining the serovar from the clonal complex may be less appropriate for the classification of this group. Novel sequence types that were not present in the MLST database were identified in 8.6% of the total number of samples tested (including S. enterica subspecies I–IV and S. bongori) and these 654 isolates belonged to 326 novel STs. For S. enterica subspecies I, WGS MLST derived serotyping is a high throughput, accurate, robust, reliable typing method, well suited to routine public health surveillance. The combined output of ST and serovar supports the maintenance of traditional serovar nomenclature while providing additional insight on the true phylogenetic relationship between isolates.

          Related collections

          Most cited references26

          • Record: found
          • Abstract: found
          • Article: not found

          The global burden of nontyphoidal Salmonella gastroenteritis.

          To estimate the global burden of nontyphoidal Salmonella gastroenteritis, we synthesized existing data from laboratory-based surveillance and special studies, with a hierarchical preference to (1) prospective population-based studies, (2) "multiplier studies," (3) disease notifications, (4) returning traveler data, and (5) extrapolation. We applied incidence estimates to population projections for the 21 Global Burden of Disease regions to calculate regional numbers of cases, which were summed to provide a global number of cases. Uncertainty calculations were performed using Monte Carlo simulation. We estimated that 93.8 million cases (5th to 95th percentile, 61.8-131.6 million) of gastroenteritis due to Salmonella species occur globally each year, with 155,000 deaths (5th to 95th percentile, 39,000-303,000 deaths). Of these, we estimated 80.3 million cases were foodborne. Salmonella infection represents a considerable burden in both developing and developed countries. Efforts to reduce transmission of salmonellae by food and other routes must be implemented on a global scale.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Multilocus Sequence Typing as a Replacement for Serotyping in Salmonella enterica

            Introduction For over 70 years, epidemiological investigations of Salmonella that infect humans and animals have depended on serotyping, the binning of isolates into serovars [1], [2]. Salmonella serotyping depends on specific agglutination reactions with adsorbed antisera that are specific for epitopes (‘factors’) within either lipopolysaccharide (O antigen; encoded by rfb genes) or one of the two, alternate flagellar antigens (phases 1 and 2 of H antigen, encoded by fliC and fljB). Various combinations of 46 O antigens and 85 H antigens have resulted in ∼1,500 serovars within S. enterica subspecies enterica and ∼1000 in the other subspecies of S. enterica plus S. bongori (Fig. 1) [2]. 10.1371/journal.ppat.1002776.g001 Figure 1 General overview of the current classification of Salmonella enterica. The use of serotyping within Salmonella as a typing method is so widely accepted that governmental agencies have formulated guidelines intended to reduce human salmonellosis by targeting Typhimurium, Enteritidis and three other common serovars in domesticated animals (European Union EC Regulation 2160/2003 of 12/12/2003). Such regulations implicitly assume that serovars are associated with a particular disease potential [3], [4], an assumption that is also suggested by some of their names, e.g. Abortusequi, Abortusovis and Choleraesuis. These designations reflect a medical microbiological tradition of assigning distinctive taxonomic designations to microorganisms that are associated with particular diseases or hosts. However, this tradition is not necessarily warranted from an evolutionary perspective, as illustrated by the following examples. For some taxa, species designations have been used to designate genetically monomorphic clones of a broader species with a different pathogenic potential, e.g. the clone of Yersinia pseudotuberculosis that is called Y. pestis [5], the host-specific ecotypes of the Mycobacterium tuberculosis complex that are designated M. bovis, M. microti, M. pinnipedii and M. caprae [6], or the isolates of Escherichia coli that have been assigned to multiple species of the genus Shigella [7]. In other cases, taxonomic designations have grouped members of paraphyletic groups of microorganisms because they cause similar diseases, such as the anthrax toxin-producing variants of Bacillus cereus that are designated Bacillus anthracis [8]. That all isolates of an individual serovar of S. enterica share a common phylogenetic ancestry should therefore be considered to represent a working hypothesis that requires confirmation. Similarly, a supposed host and/or disease specificity needs to be confirmed by genetically informative methods with isolates from diverse geographical regions. These working hypotheses has been confirmed for serovar Typhi, which corresponds to a genetically monomorphic, recently evolved clone that causes typhoid fever in humans [9]–[11]. In contrast, multiple, discrete lineages have been identified within serovar Newport [12]. Close genetic relatedness and a monolithically uniform association with host/disease specificity remain to be demonstrated for most other serovars, especially because only few of them have yet been investigated in detail. Serovar designations are widely used for epidemiological purposes due to the belief that they are discriminatory, and because serovars represent a globally understandable form of communication. However, as noted by McQuiston et al. [13], [14], serotyping has multiple disadvantages, including low throughput, high expense, and a requirement for considerable expertise as well as numerous antibodies made by immunizing rabbits. As a result, various molecular methods have been proposed as potential alternatives to serotyping for subdividing Salmonella (and other microbes) [15], [16], ranging from PFGE (Pulsed-Field Gel Electrophoresis) [17], [18] through to MLVA (MultiLocus Variable number of tandem repeats Analysis) [19], [20]. These methods are possibly useful for recognizing a common source of microorganisms from a single outbreak [21], but they are inappropriate for reliable assignments of isolates to one of the 2,500 S. enterica serovars. Still other attempts have been made to develop DNA-sequence based equivalents of serotyping [22]–[26], including the detection of particular single nucleotide polymorphisms (SNPs) within flagellar antigens [13], [14]. This approach shares with serotyping the assumption that serotyping reflects genetic relatedness or disease specificity, which needs not be generally true [12]. For example, genes encoding antigenic epitopes can be imported by horizontal genetic exchange and homologous recombination from unrelated lineages. As a result, genetically related serovars such as Heidelberg and Typhimurium possess very different fliC alleles whereas genetically distinct serovars can possess nearly identical alleles [27]. Thus, replacing serological determination by serotype-based molecular assays would maintain a system that does not necessarily reflect genetic relatedness. Furthermore, some serovar designations will need revision because they distinguish between minor antigenic variants of organisms that are genetically very similar, e.g. Dublin and Rostock [28] or Paratyphi A and Sendai [29]. We recommend another approach, namely using neutral markers to identify genetically related clusters of S. enterica. Serovar designations that reflect such groupings could be preserved, and possibly be detected by informative SNPs in those neutral markers, whereas other serovars need to be revised or possibly eliminated. Twenty years ago, a valiant attempt was made to identify natural groupings within S. enterica on the basis of MultiLocus Enzyme Electrophoresis (MLEE) [29]–[31]. MLEE data identified multiple monophyletic lineages that corresponded to individual serovars. Problematically, most serovars that were examined included exceptional isolates that were unrelated to the main lineage, and some serovars were composed of multiple, genetically unrelated lineages rather than one predominant lineage. MLEE was never generally accepted by microbiologists and these observations have not influenced the general use of serovar designations. Instead of MLEE, a sequence-based alternative, MultiLocus Sequence Typing (MLST), has gained broad acceptance for many microbial species [32]. MLST is based on similar principles to MLEE, but has greater discrimination and is more objective because it is based on sequences of multiple housekeeping gene fragments rather than electrophoretic migration of proteins. Of equal importance, MLST schemes are community efforts because the data are publicly available online (http://pubmlst.org/databases.shtml) and data can be entered from decentralized sources. Isolates that possess identical alleles for all gene fragments are assigned to a common Sequence Type (ST), and STs that share all but one or two alleles are grouped into ST-based clonal complexes [33] on the basis of eBurst [34]. An MLST scheme involving seven housekeeping gene fragments was developed for the analysis of serovar Typhi [9], and subsequently tested with 110 isolates from 25 serovars of S. enterica subspecies enterica [35], most of which were from Selander's SARB collection of reference strains for MLEE [30]. Subsequent analyses have used this scheme to survey serovars Newport [12], [36] and Typhimurium [37]–[39], as well as smaller numbers of isolates of various serovars from wild animals in Australia [40] and the mesenteric lymph nodes of cattle in Canada [41]. The same scheme has also been used to survey the genetic properties of antibiotic-resistant isolates among a global sample of various serovars [42]. These initial results suggested that MLST often correlates with serovar, with some exceptions. If this inference were correct, it would be advisable to replace serotyping by MLST for routine epidemiological purposes. We therefore embarked on a major, decentralized effort to test this hypothesis. We investigated isolates from diverse hosts, both diseased and healthy, as well as from the environment. We screened isolates from all continents and deliberately included representatives of rare serovars as well as unusual monophasic and diphasic variants from reference collections. All this data was submitted to a publically accessible MLST database (http://mlst.ucc.ie/mlst/dbs/Senterica). In April, 2011, that database included 4,257 isolates (Table S1) from 554 serovars of S. enterica subspecies enterica that had been assigned to 1,092 STs. The database also contained 436 isolates from the other S. enterica subspecies as well as Salmonella bongori, whose properties will be described elsewhere, as will analyses of associations with host or geography. Here we describe the population structure of subspecies enterica on the basis of MLST, examine the extent of congruence between serotyping and MLST clusters, and conclude that serotyping of S. enterica should be replaced by MLST. Results Many Salmonella STs cluster together in discrete groups, which we refer to as eBGs (eBurstGroups). We chose the designation eBG rather than “Clonal Complex” or “ST Complex” because Clonal Complex implies clonality [43], whereas homologous recombination between unrelated lineages is frequent in S. enterica [12], [44], [45], and ST Complex does not specify a grouping algorithm. Following the recommendations by Feil et al. [46], [47], we designated as an eBG all groups of two or more STs that were connected by pair-wise identity at six of the seven gene fragments, i.e. they shared six of the seven alleles that defined the ST. As the MLST database has grown, multiple singleton STs containing multiple isolates have formed eBG clusters via the incremental identification of novel, related STs. We therefore also designated ungrouped singleton STs as eBGs when they contained 10 or more isolates. Finally, a few existing eBGs were expanded to include singleton STs that shared five identical alleles (double locus variants; DLVs) as well as a common serovar. Based on these criteria, 3,550 of the 4,257 isolates were assigned to a total of 138 eBGs, containing between 580 isolates in multiple STs and two isolates in two STs (Table S2). eBGs are natural clusters of genetically related isolates We initially recognized the existence of eBGs by visual examination of a minimal spanning tree (MSTree) of STs connected by the numbers of shared alleles. The MSTree of subspecies enterica shows multiple starburst-like clusters (Fig. 2), which in large part correspond to eBGs as defined here. Similar to eBurst groups in other species, most clusters radiate from a central node which contains numerous isolates, a phenomenon which is usually interpreted as representing monophyletic lineages of STs that have evolved from a single founder node [34]. We deferred interpretations on evolutionary history within eBGs, including the identification of founders, until genomic studies of historically representative isolates have been conducted, and therefore arbitrarily assigned an otherwise uninformative, unique number to each eBG. 10.1371/journal.ppat.1002776.g002 Figure 2 Minimal spanning tree (MSTree) of MLST data on 4257 isolates of S. enterica subspecies enterica. Each circle corresponds to one of 1,095 STs, whose size is proportional to the number of isolates. The topological arrangement within the MSTree is dictated by its graphic algorithm, which uses an iterative network approach to identify sequential links of increasing distance (fewer shared alleles), beginning with central STs that contain the largest numbers of isolates. As a result, singleton STs are scattered throughout the MSTree proximal to the first node that was encountered with shared alleles, even if equal levels of identity exist to other nodes that are distant within the MSTree. The figure only show links of six identical gene fragments (SLVs; thick black line) and five identical gene fragments (DLVs; thin black line) because these correlate with eBGs, which are indicated by grey shading. The serovar associated with most of the isolates in each eBG or singleton ST is indicated by color coding for the 28 most frequent serovars (see legend at lower right). Within each ST, isolates of a different serovar or for which information is lacking are shown in white, except for monophasic variants. Historically, MLEE data of S. enterica were interpreted on the basis of phylogenetic trees [29]–[31]. Trees attempt to depict genealogies (vertical descent from a common ancestor), and can be confounded by homologous recombination between unrelated lineages, a common occurrence in S. enterica [44], [45]. Indeed, only one higher level population structure with strong statistical support has been identified within subspecies enterica; this structure has been referred to as Clade B [40], [44], [48] or Lineage 3 [45]. We confirmed the existence of Lineage 3 in our large dataset by a BAPS [49] cluster analysis of the allelic differences between STs using an upper bound of 2–7 clusters (Fig. S2). Similar results were obtained with concatenated sequences for all seven gene fragments regardless of upper bound, or when using Structure [50]. In order to assess the robustness of our eBG classification, we investigated the fine structure of subspecies enterica by three additional, independent clustering methods. Firstly, we analyzed concatenated sequences with ClonalFrame [51], which determines tree topologies after stripping signals of lateral gene transfer and homologous recombination. ClonalFrame identified 163 lineages containing more than one ST (Table 1), each of which coalesced far from the root (Fig. S3). This result provides further support for the conclusion [44], [45] that there is little deep phylogenetic signal within the MLST genes. Secondly, we analyzed the sequence data by a gene by gene bootstrap approach as described by Falush et al. [44]. A consensus UPGMA tree based on the concatenated sequences was then stripped of branches which did not find 50% support in 1000 gene by gene bootstrap trees. The bootstrap approach identified 167 clusters of STs. Finally, we used BAPS on allelic identities with an upper bound of 400, which resulted in 216 clusters. For each of the three methods, many clusters each contained only one of the 138 eBGs and most or nearly all of the 138 eBGs contained isolates that were all assigned to a single cluster by each of the three alternative approaches (Table 1). The three methods were also largely congruent: for 108 eBGs, all the isolates were assigned to a single cluster by all three methods and for 24 others, the isolates were clustered together by two methods (Fig. 3). Finally, data permutation revealed that all of these correspondences between eBGs and the other methods were significantly non-random (p 70 years, and which is so embedded in microbiological thinking, the use of serotyping alone is often uninformative. Most of the S. enterica isolates in any European country belong to a very limited number of serovars, usually fewer than 10 (Fig. S8). In fact in recent years, most isolates belonged to Enteritidis, Typhimurium or Infantis, which results in relatively low discrimination. Furthermore, many current isolates of Typhimurium are monophasic and cannot be unambiguously recognized by serotyping [85]. Epidemiological investigations of outbreaks often depend on phage typing [86], PFGE [17], [18] or MLVA [19], alone or in combination, usually after initial triage based on serotyping. These methods could continue to be used, and are likely to be even more effective if combined with an initial assignment to genetic groupings such as eBGs. MLST for S. enterica MLST was first described in 1998 [87] and has now become the gold standard for long term epidemiology and population genetic analyses of pathogenic microbes. Of the 79 MLST databases that are publicly available (http://pubmlst.org/databases.shtml), the S. enterica MLST database (http://mlst.ucc.ie) ranks fourth in number of isolates. This publicly accessible and actively curated web-based MLST database facilitates the global exchange of information. In particular, new alleles and new STs depend on user submissions rather than decisions by a central reference laboratory, and are immediately made publicly accessible. Similar global exchange of information at the strain level does not exist for serotyping. The database currently provides data for >500 of the 1,500 existing serovars in subspecies enterica, including all common serovars and many that are rare. These data have been accumulated through a decentralized global effort since 2002 and with time, we anticipate that representatives of all 1,500 serovars will be tested, thus providing a reasonably complete mapping between serovar and eBG/ST. The data presented here demonstrate that MLST is a valuable tool for the identification of genetic clusters and elucidating the diversity of known serovars. We also believe that it has the potential to completely replace serotyping, over which it possesses multiple advantages. Replacement of serotyping by MLST would involve changes in nomenclature. In cases where eBGs are relatively uniform in serovar and correspond to monophyletic groups, the serovar designations could be maintained together with the eBG designation for an interim period in order to provide continuity, e.g. eBG1 (Typhimurium). For polyphyletic serovars, the serovar designation has little information content and should be eliminated as soon as possible, as is the case for other species for which MLST has become the common language. Even now, a surprisingly large numbers of entries are already being deposited at the MLST website without accompanying serovar information. In private discussions, some individuals have claimed that MLST is too technically demanding, expensive and slow. However, performing MLST does not require much more than a PCR machine plus training on working with DNA sequences. Our experience is that MLST does not require much technical competence, and laboratory scientists who are capable of handling serotyping can readily learn to handle MLST. MLST is cheaper than serotyping, sequencing of PCR products can be performed commercially and it can be automated. In our hands, with the help of robotic fluidics, one individual can easily complete the necessary manipulations from initial single colony isolation through to finished sequencing at the rate of 200 isolates per week and a cost per isolate of under €25. A few days are needed to enter the sequence traces into a database and evaluate them with the help of dedicated scripts. In general, a small fraction of traces need to be repeated, which then doubles the time needed to provide definitive results for all 200 isolates. We anticipate that in the future, technical developments will allow even higher throughput of MLST assignments through multiplexed SNP-based typing and/or next-generation sequencing. Other individuals have claimed that MLST will soon be replaced by whole genome sequencing (WGS), whose price is rapidly approaching that of MLST. Instead we argue that WGS and MLST are complementary, and should be pursued in parallel. WGS provides essential information for epidemiological tracking and will yield invaluable insights into the detailed population structure of bacterial pathogens [69], [88], including S. enterica. However, the evaluation of SNPs and genomic sequences from WGS takes much more time than the evaluation of paired traces from seven gene fragments. WGS currently suffers from differences between samples in quality and number of reads per nucleotide, which presents difficulties in extracting identical gene fragments from multiple genomes due to variable missing data. The S. enterica MLST database will probably contain data for >10,000 isolates in the near future, as do three other MLST databases today, whereas it would currently be difficult to extract information with comparable certainty from that many genomes. We propose that MLST should be used to provide a rapid overview of the population structure of S. enterica, which can then be used to identify selected isolates for investigation in greater detail by genome sequencing. Such efforts including the integration of genomic sequences and MLST data are already underway [89]. A third criticism of MLST for S. enterica is that it does not provide the fine resolution needed for outbreak analysis and short-term epidemiology. Indeed, MLST data does not generally have the same fine resolution as phage typing, PFGE, and MLVA. Multiple phage types were present within ST19, the central ST in eBG1 (Typhimurium), and within ST11, the central ST of eBG4 (Enteritidis, Gallinarum, Pullorum). However, MLST does provide somewhat greater resolution than serotyping because eBGs tends to contain multiple STs once a sufficient number of isolates has been tested. On occasion, MLST has also given hints of phylogeographic and host specificity. For example, invasive disease caused by Typhimurium in Africa is associated with ST313 and its descendent SLVs within eBG1 [39]. ST213 within eBG1 has only been isolated in Mexico [38]. Similarly, STs 66 and 634 of eBG6 (Choleraesuis) were first isolated in Canada (1978) and the USA (1981–1986) and subsequently from humans and swine in Taiwan (1998–2004). A potential link between these isolates may have been breeding pigs, which have been imported into Taiwan from Canada and the USA since 1980 (http://www.angrin.tlri.gov.tw/indexd/AGLP.htm). We conclude that MLST is a powerful candidate for the reference classification system for Salmonella, and can replace serotyping for that purpose. Similar to serotyping, additional methods will be needed to provide the fine resolution that is required for short term epidemiology. In other species where serotyping was previously the common language for strain tracking and epidemiology, such as E. coli or Klebsiella pneumoniae, it was rapidly replaced by MLST nomenclature after its introduction. We are confident that MLST designations will be also be adopted widely in the near future for S. enterica. By eliminating multiple misleading interpretations about strain relatedness associated with serotyping, this step would represent a major improvement for the epidemiology and control of Salmonella infections. Materials and Methods Bacterial strain collection and microbiological properties The analyses presented here are based on 4257 isolates whose data has been submitted to http://mlst.ucc.ie/mlst/dbs/Senterica by ourselves and others. Of these, 1770 are maintained in the strain collection of MA at University College Cork, and 1042 in the strain collection of FXW at the Institut Pasteur, for a total of 2643 in either or both of those collections. Biotyping and serotyping were performed in multiple laboratories but most of the tests were performed under the supervision of FXW or MC. Serotyping and biotyping were according to the modified Kauffmann-White scheme [2], except as described below. Basic information on all isolates can be downloaded from the website. In addition, a detailed description of strain properties for Paratyphi B and Java isolates is presented in Table S6. The distinction between Paratyphi B and Java was based on two tests, which gave concordant results after up to 7 days incubation: the lead acetate protocol 1 for d-tartrate fermentation described by Malorny et al. [58] and the ability to grow on d-tartrate as the sole carbon source as described by Weill et al. [64]. The start codon of STM3356 was sequenced as described by Malorny et al. [58]. Table S7 gives detailed information on results with 6,7:c:1,5 isolates. These were assigned to serovars on the basis of the biochemical properties which are summarized in Table 3, and which are similar to the tests and recommendations by Le Minor et al. [65]. Mucate utilization, ducitol fermentation and H2S production were evaluated after 24 hrs incubation in standard media and tartrate fermentation was evaluated after 7 days, as described above. A separate manuscript is in preparation on differences between the contents of Selander's SARA and SARB collections. The conclusions drawn here were largely based on isolates stored by Kenneth E. Sanderson and corroborated by the collection of Fidelma Boyd. Serovar assignments were according to information uploaded to the website except that many atypical isolates and the Paratyphi B, Java and 6,7:c:1,5 isolates were retyped. DNA sequencing MLST was performed on seven gene fragments as described [9], [12] using the amplification and sequencing primers that are described on the MLST website. Sequences for each gene fragment were assembled from at least two independent PCR products, and trimmed to a constant length of 399–501 bp as indicated on the website. All allelic sequences and allelic combinations can be freely downloaded from the website. fliC and fljB were sequenced using the same oligonucleotide primers for PCR amplification and sequencing as previously described [90], [91]. These primers each yield a ∼1500 bp product, which were trimmed to correspond to positions 73–1344 within the fliC gene and 109–1428 within the fljB gene, as shown in Figs. 6 and S5. Sequences have been deposited in GenBank under the accession codes HQ871156–HQ871237 (Table S8). Microarray analysis of SPI-7 (Salmonella Pathogenicity Island-7) A custom oligonucleotide probe-based array was designed as previously described [92] in order to detect genes related to the absence and presence of SPI-7. After labelling, probes were purified and applied to microarray slides [93]. Genomic DNA was sonicated to yield 200–500 bp fragments, purified and labelled with Cy3-dCTP using the BioPrime DNA Labelling System (Invitrogen–BioSciences Ltd., Dun Laoghaire, Ireland). Duplicate slides were hybridized with the dCTP labelled DNAs in 48% formamide at 55oC for 16–20 hrs in a humid chamber. The slides were washed at RT, washed again at 50oC, scanned (GenepixR 4000B laser scanner, Axon Instruments, Redwood City, Calif.) and processed (GenePix Pro 3.0). The full dataset was analyzed using R (www.r-project.org), and Bioconductor (www.bioconductor.org) as described [94]. In brief, the bimodal distribution that was observed was treated as two overlapping Normal distributions. Means and 95% confidence intervals were determined for each distribution. Probes were scored “absent” if the log2 intensity was within or below the 95% CI for the “low” peak; “present” if the log2 intensity was within or above the 95% CI for the “high” peak and intermediate values were scored as “uncertain”. As a control, PCR tests similar to those described previously [95] were used to screen for presence or absence of larger regions of SPI-7. Phylogenetic analyses Concatenated sequences from all seven gene fragments within 1092 STs were aligned using Mega 4 [96] and analyzed by ClonalFrame [51], yielding the tree in Fig. S3 and a total of 903 clustered STs in 163 groups. Gene by gene bootstraps [44] were also performed on 1092 STs, except that for each of 1000 iterations, the seven gene fragments used for concatenation were chosen at random from the seven genes, with replacement. UPGMA trees were generated from all 1000 iterations using Paup [97] and a homemade script in Perl (available on request) was used to generate a 50% consensus tree based on the percentage support for each branch. 569 branches to individual STs that did not meet these criteria were excluded by this script. dN and dS were calculated on each gene fragment using Mega. UPGMA trees of the fliC and fljB nucleotide sequences and the FliC and FljB amino acid sequences were generated in Bionumerics 6.5 (Applied Maths, Sint-Martens-Latem, Belgium), as shown in Figs. 7–8 and S4–S7. Maximum likelihood topologies of synonymous and non-synonymous sites were calculated using PhyML [98]. Clustering analyses A minimal spanning tree was generated from the allelic profiles of isolates using the predefined template in BioNumerics 6.5 designated as MST for categorical data, which preferentially joins single and double locus variants with the largest number of isolates per ST. For allelic comparisons, Baps 5.3 [49] was applied to the allelic profiles from each ST with an upper bound for group numbers ranging between 300 and 500. The number of clusters ranged from 215 to 221 as the upper bound increased. The data presented here are based on an upper bound of 400, which yielded 216 clusters. Baps was also used with allelic differences with an upper bound of 2–7 or with concatenated sequences (Fig. S2) as described in Text S1. Supporting Information Figure S1 MSTree from Fig. 2 color-coded according to BAPS assignments to five clusters of allelic differences among 1097 STs. STs assigned to lineage 3 are colored in red and the four other colors indicate four other clusters of STs. Similar results were obtained with BAPS or STRUCTURE assignments to 5 clusters based on concatenated sequences of the seven MLST genes. The existence of STs from the other four clusters near the bottom of the figure is due to rare intermediate STs with recombinant alleles that artificially join lineage 3 to other clusters in a minimal spanning tree. (PDF) Click here for additional data file. Figure S2 H, the index of genetic diversity, versus number of isolates per serovar in the MLST database. H was calculated as (n/(n-1))*(1.0 - the sum of squares of the relative frequency per serovar of isolates in discrete eBGs or singleton STs) where n is the total number of isolates for that serovar. H values above 0.0 indicate multiple eBGs/STs per serovar. Each dot corresponds to one or more serovars from Table S1 from which at least two isolates had been MLST typed. The sizes of the dots indicate the number of serovars for each data point with overlapping numbers of isolates and H values (see legend). Note that the abscissa is logarithmic rather than linear. (PDF) Click here for additional data file. Figure S3 Radial dendrogram of 163 clusters of STs and 189 singleton STs found by ClonalFrame among concatenated sequences of seven housekeeping genes from 1,092 STs of S. enterica subspecies enterica. Each line represents a distinct ST, and groups of related STs are seen at the periphery of the dendrogram. (PDF) Click here for additional data file. Figure S4 UPGMA tree of diversity within a 448 amino acid fragment of the FliC protein. (PDF) Click here for additional data file. Figure S5 Variant nucleotides in a 1,320 bp fragment of the fljB gene. Position refers to the nucleotide position within the trimmed fragment, which starts 108 bp from the beginning of the intact gene in strain LT-2. (PDF) Click here for additional data file. Figure S6 UPGMA tree of nucleotide diversity within a 1,320 bp fragment of the fljB gene. (PDF) Click here for additional data file. Figure S7 UPGMA tree of diversity within a 440 amino acid fragment of the FljB protein. (PDF) Click here for additional data file. Figure S8 Diversity versus frequency of S. enterica subspecies enterica isolates in France, the EU and the USA. Frequencies of serovars in pooled data over several years are plotted semi-logarithmically against H for each serovar as in Fig. S2. For parts B-D, all serovars are included and the numbers of discrete serovars at each position is indicated by different sized circles (see legend). Part A is based on the 29 most common serovars, none of which overlapped within the scattergram. Data were obtained from http://www.ecdc.europa.eu/en/activities/surveillance/TESSy/Pages/TESSy.aspx (A), internal records at the French National Reference Center for Salmonella, Institut Pasteur (B), as well as http://www.cdc.gov/ncidod/dbmd/phlisdata/salmonella.htm (C, D). (PDF) Click here for additional data file. Table S1 eBurstGroups and singleton STs per serovar among 4,257 isolates of S. enterica subspecies enterica. (XLSX) Click here for additional data file. Table S2 Serovars in 137 eBurstGroups containing 3,550 isolates of S. enterica subspecies enterica. (XLSX) Click here for additional data file. Table S3 Antigenic formulas, eBGs and STs of serovars associated with Typhimurium. (DOC) Click here for additional data file. Table S4 Antigenic formulas, eBGs and STs of serovars associated with Enteritidis and Dublin. (DOC) Click here for additional data file. Table S5 Antigenic formulas, eBGs, STs and dTar status of serovars associated with Paratyphi B. (DOC) Click here for additional data file. Table S6 Comparison of groupings from MLST versus MLEE and virulence tests for serovars Paratyphi B and var Java. (XLSX) Click here for additional data file. Table S7 Properties of supposed 6,7:c:1,5 isolates. (XLSX) Click here for additional data file. Table S8 Genbank accession codes and sequence groupings of fliC and fljB alleles. (XLS) Click here for additional data file. Text S1 Deep phylogenetic structure and historical information regarding 6,7:c:1,5 isolates. (DOCX) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Salmonella serotype determination utilizing high-throughput genome sequencing data.

              Serotyping forms the basis of national and international surveillance networks for Salmonella, one of the most prevalent foodborne pathogens worldwide (1-3). Public health microbiology is currently being transformed by whole-genome sequencing (WGS), which opens the door to serotype determination using WGS data. SeqSero (www.denglab.info/SeqSero) is a novel Web-based tool for determining Salmonella serotypes using high-throughput genome sequencing data. SeqSero is based on curated databases of Salmonella serotype determinants (rfb gene cluster, fliC and fljB alleles) and is predicted to determine serotype rapidly and accurately for nearly the full spectrum of Salmonella serotypes (more than 2,300 serotypes), from both raw sequencing reads and genome assemblies. The performance of SeqSero was evaluated by testing (i) raw reads from genomes of 308 Salmonella isolates of known serotype; (ii) raw reads from genomes of 3,306 Salmonella isolates sequenced and made publicly available by GenomeTrakr, a U.S. national monitoring network operated by the Food and Drug Administration; and (iii) 354 other publicly available draft or complete Salmonella genomes. We also demonstrated Salmonella serotype determination from raw sequencing reads of fecal metagenomes from mice orally infected with this pathogen. SeqSero can help to maintain the well-established utility of Salmonella serotyping when integrated into a platform of WGS-based pathogen subtyping and characterization.
                Bookmark

                Author and article information

                Contributors
                Journal
                PeerJ
                PeerJ
                PeerJ
                PeerJ
                PeerJ
                PeerJ Inc. (San Francisco, USA )
                2167-8359
                5 April 2016
                2016
                : 4
                : e1752
                Affiliations
                [1 ]Gastrointestinal Bacterial Reference Unit, Public Health England , London, United Kingdom
                [2 ]Applied Laboratory and Bio-Informatics Unit, Public Health England , London, United Kingdom
                [3 ]Gastrointestinal Infections, NIHR Health Protection Research Unit in Gastrointestinal Infections , London, United Kingdom
                Article
                1752
                10.7717/peerj.1752
                4824889
                27069781
                1015ce9d-bc04-4721-b6ab-bbe4f1279dd2
                ©2016 Ashton et al.

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

                History
                : 9 October 2015
                : 12 February 2016
                Funding
                Funded by: National Institute for Health Research Health Protection Research Unit (NIHR HPRU)
                Funded by: Public Health England (PHE)
                Funded by: University of East Anglia
                Funded by: University of Oxford
                Funded by: Institute of Food Research
                The research was partially funded by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Gastrointestinal Infections at the University of Liverpool in partnership with Public Health England (PHE), University of East Anglia, University of Oxford and the Institute of Food Research. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Bioinformatics
                Genomics
                Microbiology
                Public Health

                whole genome sequencing,salmonella,bioinformatics,multi-locus sequence typing,public health

                Comments

                Comment on this article