53
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The Opportunistic Pathogen Propionibacterium acnes: Insights into Typing, Human Disease, Clonal Diversification and CAMP Factor Evolution

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          We previously described a Multilocus Sequence Typing (MLST) scheme based on eight genes that facilitates population genetic and evolutionary analysis of P. acnes. While MLST is a portable method for unambiguous typing of bacteria, it is expensive and labour intensive. Against this background, we now describe a refined version of this scheme based on two housekeeping ( aroE; guaA) and two putative virulence ( tly; camp2) genes (MLST 4) that correctly predicted the phylogroup (IA 1, IA 2, IB, IC, II, III), clonal complex (CC) and sequence type (ST) (novel or described) status for 91% isolates (n = 372) via cross-referencing of the four gene allelic profiles to the full eight gene versions available in the MLST database ( http://pubmlst.org/pacnes/). Even in the small number of cases where specific STs were not completely resolved, the MLST 4 method still correctly determined phylogroup and CC membership. Examination of nucleotide changes within all the MLST loci provides evidence that point mutations generate new alleles approximately 1.5 times as frequently as recombination; although the latter still plays an important role in the bacterium's evolution. The secreted/cell-associated ‘virulence’ factors tly and camp2 show no clear evidence of episodic or pervasive positive selection and have diversified at a rate similar to housekeeping loci. The co-evolution of these genes with the core genome might also indicate a role in commensal/normal existence constraining their diversity and preventing their loss from the P. acnes population. The possibility that members of the expanded CAMP factor protein family, including camp2, may have been lost from other propionibacteria, but not P. acnes, would further argue for a possible role in niche/host adaption leading to their retention within the genome. These evolutionary insights may prove important for discussions surrounding camp2 as an immunotherapy target for acne, and the effect such treatments may have on commensal lineages.

          Related collections

          Most cited references67

          • Record: found
          • Abstract: found
          • Article: not found

          eBURST: inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data.

          The introduction of multilocus sequence typing (MLST) for the precise characterization of isolates of bacterial pathogens has had a marked impact on both routine epidemiological surveillance and microbial population biology. In both fields, a key prerequisite for exploiting this resource is the ability to discern the relatedness and patterns of evolutionary descent among isolates with similar genotypes. Traditional clustering techniques, such as dendrograms, provide a very poor representation of recent evolutionary events, as they attempt to reconstruct relationships in the absence of a realistic model of the way in which bacterial clones emerge and diversify to form clonal complexes. An increasingly popular approach, called BURST, has been used as an alternative, but present implementations are unable to cope with very large data sets and offer crude graphical outputs. Here we present a new implementation of this algorithm, eBURST, which divides an MLST data set of any size into groups of related isolates and clonal complexes, predicts the founding (ancestral) genotype of each clonal complex, and computes the bootstrap support for the assignment. The most parsimonious patterns of descent of all isolates in each clonal complex from the predicted founder(s) are then displayed. The advantages of eBURST for exploring patterns of evolutionary descent are demonstrated with a number of examples, including the simple Spain(23F)-1 clonal complex of Streptococcus pneumoniae, "population snapshots" of the entire S. pneumoniae and Staphylococcus aureus MLST databases, and the more complicated clonal complexes observed for Campylobacter jejuni and Neisseria meningitidis.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Multilocus sequence typing of bacteria.

            Multilocus sequence typing (MLST) was proposed in 1998 as a portable, universal, and definitive method for characterizing bacteria, using the human pathogen Neisseria meningitidis as an example. In addition to providing a standardized approach to data collection, by examining the nucleotide sequences of multiple loci encoding housekeeping genes, or fragments of them, MLST data are made freely available over the Internet to ensure that a uniform nomenclature is readily available to all those interested in categorizing bacteria. At the time of writing, over thirty MLST schemes have been published and made available on the Internet, mostly for pathogenic bacteria, although there are schemes for pathogenic fungi and some nonpathogenic bacteria. MLST data have been employed in epidemiological investigations of various scales and in studies of the population biology, pathogenicity, and evolution of bacteria. The increasing speed and reduced cost of nucleotide sequence determination, together with improved web-based databases and analysis tools, present the prospect of increasingly wide application of MLST.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Horizontal Transfer, Not Duplication, Drives the Expansion of Protein Families in Prokaryotes

              Introduction Prokaryotes have highly variable gene repertoires, varying from just over 100 genes to nearly 10000 [1], [2]. Such variations in genome size are typically associated with expansions and contractions of protein families. Expansions of protein families are associated with the acquisition of novel functions, novel regulatory structures and system robustness [3]–[5]. They can take place by horizontal gene transfer, in which case homologs are called xenologs, or by intra-chromosomal duplication, in which case homologs are called paralogs. Genes are identical upon duplication and if there is no direct selection on multiple copies, the redundant extra copies are quickly lost. Yet, sometimes gene duplications persist in genomes due to selection for increased gene dosage. Periods of such adaptive gene duplication are windows of opportunity for the acquisition of new functions by the slow evolutionary divergence between the duplicates [6]–[8]. Duplicated genes can thus be fixed because sub- or neo-functionalization processes render the two copies adaptive when selection for higher gene dosage ceases (reviewed in [9]–[12]). There is ample evidence that intra-chromosomal gene duplication (IGD) has an adaptive role in bacteria [13], [14], e.g. in antigenic variation, antibiotic resistance or in genome expansion [8], [15]–[19]. Furthermore, models aiming at explaining the patterns of biological networks are in general based on the preconception that protein families expand by gene duplication processes [20], [21]. Horizontal gene transfer (HGT) results in the acquisition of radically new genetic information, liberating genomic evolutionary processes from tinkering exclusively with pre-existing genes [22]–[24]. HGT occurs at very high rates leading to very large species pan-genomes [25]. Yet, while up to 96% of the genes in a given prokaryote genome might have been affected by HGT [26]–[28], it has been estimated that HGT contributes at best to 25% of all expansions of protein families [26], [29]–[32] (22% overall but 60% for large protein families in [23]). These contradictory conclusions require an explanation. Prior analyses on the relative abundance of paralogs over xenologs were performed several years ago using the available distant genomes. However, comparisons of genomes from very divergent lineages pose several problems: (i) disambiguation of orthologs, paralogs and xenologs is very error-prone, (ii) lineages are saturated with changes, (iii) expansions within terminal branches are associated with paralogy when they can result from HGT between closely related taxa. In fact, long terminal branches hide a complex history of duplications and transfer and consequently the relative roles of transfer and duplication can only be assessed using closely related genomes. Here, we apply a protein family analysis pipeline to a multitude of closely related complete genomes to re-examine the role of transfer and duplication in expansion of protein families in prokaryotes. Our comparisons involve genomes that have diverged recently (Figures S1 and S2). As a result, we can use sequence similarity and co-localization information to infer accurate core genomes, phylogenies and ancestral events. At these evolutionary distances, paralogs arising from duplications having occurred since the last common ancestor are very similar in sequence and thus can easily be separated from most xenologs. Finally, at these evolutionary scales the population data enables the analysis of fixation of protein family expansions and test if some processes have a longer-term effect on genomes than others. Results Identification and characterization of protein families We analyzed 110 genomes encompassing eight distant clades (Table S1): Helicobacter, Neisseria, Streptococcus and Sulfolobus (small genomes, ∼2 Mb), Enterobacteriaceae and Bacillus (average, ∼4–5 Mb), Bradyrhizobiaceae and Pseudomonas (large, ∼6 Mb). The clades were selected to include a large number of closely related genomes and to span diverse phyla, G+C contents and genome sizes (Table 1). By using sequence similarity analysis followed by clustering (see Materials and Methods), we obtained a set of 59,541 families containing a total of 419,035 proteins (Table 1). Families with exactly one member per genome and where all members are highly similar (i.e. putative orthologs) were used to define the core genomes and to build highly robust phylogenetic trees of the clades (100% bootstraps in nearly all branches, Figure S3). Many families (48%) consisted of singletons, i.e. they contained one single protein, thereby confirming that new families in prokaryotes are introduced at high rates by HGT. These families were excluded from the analysis. 10.1371/journal.pgen.1001284.t001 Table 1 Expansions of gene families. Clade genomes genes per genome families families w/expansions paralogs & xenologs Enterobacteriaceae 41 4881 15729 1148 6803 Streptococcus 17 1965 4474 317 1255 Bradyrhizobiaceae 9 6363 14231 1009 2203 Helicobacter 8 1525 2446 88 105 Neisseria 11 2120 4922 247 335 Sulfolobus 7 2780 3782 332 505 Bacillus 12 5264 7745 461 1070 Pseudomonas 5 5466 6212 381 405 Given the pattern of presence/absence of genes and a reference tree, we inferred gene expansion events with BayesTraits [33]. Families without protein expansions within the lineages were excluded from further analysis. We also removed IS and phage sequences, but not the corresponding cargo regions, because these elements constitute a large fraction of repeats in genomes [15], are known to be horizontally transferred and are in general quickly lost [34], [35]. Our main analysis includes the remaining 3190 families. These families have few members (0–3) in each genome (Figure 1) showing that expansions are rare at these narrow evolutionary scales. This is in agreement with high rates of change but low rates of fixation of changes in gene repertoires [36], [37]. The most frequent event found in our data was the gain of a paralog or xenolog by the largest genomes in the clade (Table S2), such as S. agalactiae NEM316 (20% of all expansions in Streptococcus). Around 50% of all gene expansions occur in the clade with largest genomes –Bradyrhizobium - with 35% occurring in the two largest genomes of that clade. Although families with recent expansions account for a small fraction of genomes this is in good agreement with an association between expansions of protein families and increased genome size. 10.1371/journal.pgen.1001284.g001 Figure 1 Histogram of the normalized size of gene families. For each family we compute the number of genes in the family and subtract it by the number of genomes containing at least one member of the family. Expansions of protein families arise most frequently by HGT Duplication processes in prokaryotes produce tandem, strictly identical copies of genes [8], [38], [39]. At the evolutionary distances considered in this work, this implies that paralogs arising in the lineages are co-localized because of the low rearrangement rates and are highly similar in sequence because of the low mutation rates. On the other hand, transferred genes are inserted almost randomly in genomes and show a large range of sequence similarity relative to the native homolog. Therefore, we disentangle IGD from HGT using sequence similarity and positional information (see Materials and Methods). We define a minimal similarity threshold for paralogs assuming that they evolve at rates close to the ones of the core genome. This threshold is therefore based on the distribution of similarities between pairs of orthologs of the core genome. More precisely, we define three thresholds corresponding to values where protein divergence exceeds that of 95%, 99% or 99.9% of the comparisons between orthologous core genes. Different thresholds produced qualitatively similar results and we use the 99% thresholds throughout this work (see Materials and Methods, Figure S4). The use of sequence similarity alone will lead to spurious classification of genes transferred between closely related genomes, because such genes are expected to be highly similar. We thus add the positional criterion considering that two genes are co-localized if no core gene separates them. Expansions producing highly similar co-localized homologs are thus assumed to be created by IGD, whereas the others are created by HGT. We find that the vast majority, between 88% and 98%, of the expansions of protein families are due to HGT (Figure 2). This is true even for large genomes, such as in Bradyrhizobiaceae, which have been proposed to increase in size by gene duplication. Hence, by default, expansions are much more likely to arise by transfer than by internal duplication. 10.1371/journal.pgen.1001284.g002 Figure 2 Relative contribution of horizontal gene transfer in protein family expansions. The majority (65%) of the expansions are assigned to HGT using either the criterion of sequence similarity or of co-localization, showing the robustness of the method. If one removes the co-localization criterion, thus assuming that expansions of highly similar proteins arise exclusively by IGD, we still find HGT as the major cause for expansions of protein families in 6 of the 8 clades. The two exceptions, Neisseria and Helicobacter, have a disproportionate fraction of highly similar homologs in different locations in the genome so that removing the co-localization criterion impacts significantly in the relative contribution of IGD (resp. accounting for 78% and 45%). It is well-established that such repeated elements often engage in gene conversion for antigenic variation both in Neisseria and in Helicobacter [40], [41]. Interestingly, both clades naturally transform conspecific DNA and are known to have extremely high rates of intra-species horizontal transfer [42], [43]. The high frequency of HGT within the species is liable to produce a large fraction of very similar xenologs scattered in the genome. This factor is presumably corrected by the use of the co-localization criterion. Importantly, the relative abundance of xenologs over paralogs in the largest genomes, the ones with more expansions, remains almost unchanged when removing the co-localization criterion. Xenologs persist longer than paralogs HGT might be less relevant if xenologs were lost at higher rates than paralogs. We calculated the age of introduction of each gene in its lineage. We found that xenologs have an average age of introduction that is twice that of paralogs (Mann-Whitney-Wilcoxon, p 1.5) [46]. As expected we found the highest dN and dS values in xenologs (dSHGT = 0.43, dSIGD = 0.26, p 7 Mb) is not yet available. On the other end of the spectrum, genomes much smaller than Helicobacter have few family expansions. All genomes were retrieved from Genbank, except for Neisseria lactamica that was provided by the Sanger Center. Gene family construction The procedure used for defining gene families is summarized as follows (Figure 6). First, we performed in each clade all-against-all BLASTP [65] comparisons, inter- and intra- genomically. We set the e-value cutoff to 10−7, required the hits to be at least 100 aa in length and the length of the BLASTP hit to span at least 70% of the length of the smallest protein. Also, similar to the procedure in [23] we evaluated the results with a BLASTP bit score threshold equal to 30% of the maximal bit score (i.e. a protein matching itself). Since we are interested in multi-gene families (i.e. at least one genome has two or more genes), we clustered pairwise BLASTP hits into multiple relationships using mclblastline from MCL [66], with parameters: --blast-m9 --blast-score = e --blast-sort = a --mcl-I = 2.0 --mcl-scheme = 7. After inspection of the results using inflation parameter values from 1.0 to 10.0 in increments of 0.2, we set it to 2.0. 10.1371/journal.pgen.1001284.g006 Figure 6 Protein family construction pipeline. Starting with a databank of proteins, we first performed all pairwise similarity searches using BLASTP. The hits were filtered regarding the length of the match (70% of the length of the query) and the bitscore (30% of the maximal bitscore calculated by aligning a protein against itself). To build the gene families we ran MCL blastline and then removed all singletons, IS and Phage. To build the core genome we used OrthoMCL along with a synteny filter based on M-GCAT Clusters. Finally, using presence/absence and phylogenetic information, we obtained the protein families with expansions Identification and removal of IS, prophages, and rDNA repeats Insertion Sequences (IS) were identified by BLASTP query of a custom IS database, including all Uniprot transposases (Touchon, in preparation) and removed from the analysis. We verified that for E. coli our databank contained all IS present in IS Finder [67]. We used Phage_finder to predict prophages [68]. IS are frequently pseudogenized. To remove small fragments of IS elements we searched for regions of homology with known IS in the genome using Repeatoire [69]. Core genome construction Since genomes in each clade have diverged recently, the orthologs should be highly similar. We enlisted OrthoMCL [70] to build the groups of orthologs inside of the gene families via all-against-all BLASTP comparisons in a clade of interest. OrthoMCL defines putative orthologous relationships between a pair of genomes as BLASTP reciprocal better/best hits. To limit false positives this list was further refined by combining information on the distribution of similarity of these putative orthologs with gene order conservation data (as in [71]). At these distances, gene order conservation is high [58], and positional information can significantly decrease classification errors [72]. Each ortholog pair was then tested for gene order conservation via the ordered list of putative orthologs between the two genomes (i.e. lateral transfer is discounted). Genes not satisfying the constraint are removed from the core genome. Finally, we removed all ortholog pairs less than 65% similar in sequence and differing by more than 30% in length. Naturally, one cannot exclude that some core genes have endured changes by homologous recombination with exogenous genetic material. Yet, gene conversion requires very high similarity and such effects will not affect significantly our study. On the other hand the rare xenologous replacements of core genes by very divergent sequences are expected to lead to loss of synteny and they are therefore discarded from our core genome. Phylogenetic analyses The reference phylogenetic tree of each clade was reconstructed from the concatenated alignments of genes comprising their core genome. Protein alignments were generated using MUSCLE [73] and then back-translated to DNA. Tree-Puzzle [74] was used to generate the matrix of distances by maximum likelihood with the HKY+Γ model and exact parameter estimates. The trees were then computed from these distance matrices using BIONJ [75]. We performed 100 bootstrap experiments on the concatenated sequences to assess the robustness of the topology. Ancestral state reconstruction We used the reference phylogeny and maximum likelihood (ML) optimization in the “Multistates” component of BayesTraits [33] to estimate ancestral states (0, gene not present, 1, family with no expansions, 2, expanded family, Table S3) for each gene family on all branches of the taxa. We enabled the covarion model for trait evolution, a variant of the continuous-time Markov model that allows for traits to vary their rate of evolution within and between branches. 100 optimization attempts were carried out to find the ML solution. For each node, a gene expansion was considered as present (state = 2) if its probability as estimated by BayesTraits was ≥0.5 (the analysis was repeated with p≥0.9 and yielded similar results). In our analysis we only included gene expansion events, not deletions. Calculating the age of expansions Using the BayesTraits ancestral state information [33], we calculated the age of expansions as the distance from the leaf node (genome currently containing the paralog or xenolog) to the ancestral node where the gene expansion event occurred (with respect to the root). Expansions appearing in genomes with very long terminal edges were discarded because they might include recent and ancient acquisitions. This includes the following genomes: E. fergusonii, P. fluorescens, S. equitans, S. agalactiae. Classifying the origin of gene family expansions The key point of our methodology is the disambiguation between paralogs and xenologs by combining two key pieces of evidence: (1) sequence similarity and (2) positional information (Figure 6). 1) Sequence similarity Genes deriving from a single gene in the last common ancestor of a clade are highly similar because mutations accumulate at a slow pace and the clades selected in this study have diverged recently. In this case the distribution of similarities within gene families of the core genome is narrowly distributed around 100% (Figure 7). As the last common ancestor between two genomes becomes more distant, e.g. comparisons between different species or genera, the range of similarity of genes in the core genome becomes larger. After a certain evolutionary distance, a significant number of duplications preceding the last common ancestor will be less divergent than some orthologs of the core genome because genes evolve at markedly different rates (Figure 7). This is why it is crucial to restrict the analysis to closely related genomes. To define a meaningful threshold of similarity we computed all pairwise sequence similarities between orthologous core genes. We then identified the 99% threshold of sequence similarity, i.e. the sequence similarity below which we found only 1% of the pairwise comparisons. Two genes were regarded as highly similar when their sequence similarity was above this threshold. If the paralogs evolve as the core genome then this threshold implicates that ∼1% of paralogs are spuriously assigned as xenologs. We re-did all the analysis with the 95% and 99.9% threshold for the two clades with highest and lowest fraction of xenologs over paralogs (Enterobacteriaceae and Neisseria). The results remain largely unchanged (Figure S4). A gene arising in a lineage after the last common ancestor that has a similarity lower than that of the 99% of comparisons between the genes of the core genome is regarded as having diverged excessively to result from a duplication process and we infer that it arose by horizontal gene transfer. 10.1371/journal.pgen.1001284.g007 Figure 7 Cumulative distribution function plot of protein similarity. Colored lines correspond to CDF plots of the similarity between orthologous proteins of the core genome for the comparison of E. coli K12 W3110 with genomes of increasing phylogenetic distances. The gray line corresponds to the similarity between homologous genes in the E. coli K12 W3110 genome. 2) Positional information The remaining recently acquired genes are more similar to at least one member of the gene family than 1% of the pairwise similarities within families of the core genome. These may have arisen by gene duplication or by transfer from genomes containing closely related copies of the gene, e.g. conspecifics. Many experimental and in silico studies indicate that genetic duplications in prokaryotes create tandem repeats [8], [38], [39]. Since rearrangements occur at low rates in prokaryotes, recent paralogs are expected to be co-localized. Hence, we regard co-localized highly similar expansions as paralogs and non-colocalized expansions as xenologs. Using the syntenic map provided by the core genome and the core genome genes as markers of gene order in the ancestral genome, if two gene expansions are located in the same syntenic block and not separated by core genome genes, then they are regarded as co-localized. For paralogs or xenologs that fall in positions corresponding to synteny breakpoints in a genome, if they are inside the same breakpoint region they are considered to be co-localized. In any other situation we consider that the position is not conserved. Although, strictly speaking, genetic duplications occur in tandem, we used this more flexible definition of co-localization because pairs of paralogs can be separated by insertions and deletions after duplication (these newly arising spacing genes are not part of the core genome). Supporting Information Figure S1 M-GCAT visualization of core genome regions in 41 enterobacterial genomes. The core genome is represented as the vertical colored polygons common to all of the 41 genomes listed. Inversions can be identified as inverted rectangles. Syntenic regions of the core genome are colored according to the rainbow spectrum and meant to give a general idea of genome rearrangement. For example, violet colored regions generally remain in the same positions in all genomes, although there are some exceptions. (1.12 MB PDF) Click here for additional data file. Figure S2 CDF plot of core genome similarity. CDF plot of percent similarity in the core genome and calculated thresholds for all clades except enterobacteria. (0.61 MB PDF) Click here for additional data file. Figure S3 Phylogenetic trees for core genomes using BIONJ and Maximum Likelihood (ML) distances for all clades. (1.52 MB PDF) Click here for additional data file. Figure S4 Effects of varying core genome similarity thresholds on HGT predictions. (0.10 MB PDF) Click here for additional data file. Table S1 Clade summary. List of genomes, accession numbers, and summary. (0.16 MB DOC) Click here for additional data file. Table S2 Phyletic patterns. Patterns of expansion in all 8 clades. Occurrences of synologs per genome. 0 =  no gene present in genome, 1 =  gene present w/o synologs, 2 =  synologs in genome. Phyletic Pattern is the sequence of states for all 41 genomes for each of the families with synologs. Count indicates the frequency each phyletic pattern of synologs in the families. (0.08 MB DOC) Click here for additional data file. Table S3 Interpretation of changes in ancestral states. The events marked with an asterisk can be interpreted as multiple successive events, e.g. 2 successive gene losses. The events in italics were ignored either because (a) there was no change from the ancestral node or (b) they cannot be parsimoniously explained by gene duplication. (0.03 MB DOC) Click here for additional data file.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2013
                13 September 2013
                : 8
                : 9
                : e70897
                Affiliations
                [1 ]Centre for Infection and Immunity, School of Medicine, Dentistry and Biomedical Sciences, Queen's University, Belfast, United Kingdom
                [2 ]Institute of Biochemistry, Biological Research Centre of the Hungarian Academy of Sciences, Szeged, Hungary
                Beijing Institute of Microbiology and Epidemiology, China
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: AM. Performed the experiments: AM IN MM EB. Analyzed the data: AM IN SP. Contributed reagents/materials/analysis tools: AM IN SP. Wrote the paper: AM IN SP.

                Article
                PONE-D-13-17042
                10.1371/journal.pone.0070897
                3772855
                24058439
                aed3e22d-5d60-4590-9619-45fa31c87bb3
                Copyright @ 2013

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 26 April 2013
                : 26 June 2013
                Page count
                Pages: 22
                Funding
                The work was supported by funding from the R & D Office of the Health and Personal Social Services Northern Ireland (grant number HSCNI 9.41; http://www.dhsspsni.gov.uk/ahp_research) and The Prostate Cancer Charity, United Kingdom (grant number 110831; http://www.prostate-cancer.org.uk/). IN: Hungarian National Office for Research and Technology Teller program OMFB-00441/2007, French-Hungarian Associated European Laboratory (LEA) SkinChroma OMFB-00272/2009. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article

                Uncategorized
                Uncategorized

                Comments

                Comment on this article