22
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      VFDB 2019: a comparative pathogenomic platform with an interactive web interface

      research-article
      , , , ,
      Nucleic Acids Research
      Oxford University Press

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The virulence factor database (VFDB, http://www.mgc.ac.cn/VFs/) is devoted to providing the scientific community with a comprehensive warehouse and online platform for deciphering bacterial pathogenesis. The various combinations, organizations and expressions of virulence factors (VFs) are responsible for the diverse clinical symptoms of pathogen infections. Currently, whole-genome sequencing is widely used to decode potential novel or variant pathogens both in emergent outbreaks and in routine clinical practice. However, the efficient characterization of pathogenomic compositions remains a challenge for microbiologists or physicians with limited bioinformatics skills. Therefore, we introduced to VFDB an integrated and automatic pipeline, VFanalyzer, to systematically identify known/potential VFs in complete/draft bacterial genomes. VFanalyzer first constructs orthologous groups within the query genome and preanalyzed reference genomes from VFDB to avoid potential false positives due to paralogs. Then, it conducts iterative and exhaustive sequence similarity searches among the hierarchical prebuilt datasets of VFDB to accurately identify potential untypical/strain-specific VFs. Finally, via a context-based data refinement process for VFs encoded by gene clusters, VFanalyzer can achieve relatively high specificity and sensitivity without manual curation. In addition, a thoroughly optimized interactive web interface is introduced to present VFanalyzer reports in comparative pathogenomic style for easy online analysis.

          Related collections

          Most cited references11

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens

          Background Prediction of bacterial virulent protein sequences has implications for identification and characterization of novel virulence-associated factors, finding novel drug/vaccine targets against proteins indispensable to pathogenicity, and understanding the complex virulence mechanism in pathogens. Results In the present study we propose a bacterial virulent protein prediction method based on bi-layer cascade Support Vector Machine (SVM). The first layer SVM classifiers were trained and optimized with different individual protein sequence features like amino acid composition, dipeptide composition (occurrences of the possible pairs of ith and i+1th amino acid residues), higher order dipeptide composition (pairs of ith and i+2nd residues) and Position Specific Iterated BLAST (PSI-BLAST) generated Position Specific Scoring Matrices (PSSM). In addition, a similarity-search based module was also developed using a dataset of virulent and non-virulent proteins as BLAST database. A five-fold cross-validation technique was used for the evaluation of various prediction strategies in this study. The results from the first layer (SVM scores and PSI-BLAST result) were cascaded to the second layer SVM classifier to train and generate the final classifier. The cascade SVM classifier was able to accomplish an accuracy of 81.8%, covering 86% area in the Receiver Operator Characteristic (ROC) plot, better than that of either of the layer one SVM classifiers based on single or multiple sequence features. Conclusion VirulentPred is a SVM based method to predict bacterial virulent proteins sequences, which can be used to screen virulent proteins in proteomes. Together with experimentally verified virulent proteins, several putative, non annotated and hypothetical protein sequences have been predicted to be high scoring virulent proteins by the prediction method. VirulentPred is available as a freely accessible World Wide Web server – VirulentPred, at .
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Sequence-Based Prediction of Type III Secreted Proteins

            Introduction Many Gram-negative bacteria with symbiotic or parasitic lifestyles modulate their environment, the eukaryotic host cell, by the secretion of bacterial proteins into the host cell through the type III secretion system (TTSS) [1]. The unique role of type III mediated transport for establishing as well as maintaining infection makes it a key mechanism for bacterial pathogenesis [2]–[4]. While much progress on resolving the structure of the TTSS itself has been made recently [5], the identity and function of only few effector proteins is so far understood well. These include different virulence factors, which interact with cell signaling pathways to suppress immune response by inducing apoptosis in macrophages as the Yersina effector YopJ or the Salmonella effector SipB [6],[7]. Other known effectors manipulate the cytosceleton by actin re-arrangements as described for the Salmonella effector SipA [8]. The arsenal of known effectors varies widely between different bacterial species due to adaptation to different hosts and different survival strategies [9] and even between different strains of the same organism as shown for Pseudomonas syringae [10]. Experimental identification of novel effectors relies on translocation assays using fusion proteins of a putative effector with a reporter gene [11]–[14] or detection of effectors in the culture supernatant [11]. In many of these studies, prior information is derived computationally from the genome or from protein sequences to create candidate lists of putative effectors before testing them in an appropriate assay. Homology to known effector proteins has been used in a screen for effectors in the pathogenic Escherichia coli strain O157 [11]. Chromosomal co-localization of putative effectors with TTSS related chaperons has been used in Bordetella bronchiseptica [15]. Common transcriptional regulation with elements of the TTSS has been exploited to detect putative effectors in P. syringae [13],[16]. In the same organism, an unusual amino acid composition in the N-termini of effectors has been identified as a characteristic of effector proteins and used for their identification [16]–[18]. In all these approaches, the computational analysis successfully limited the amount of candidates which had to be included in experimental analyses in order to find novel effectors. However, none of these methods is either exhaustive or generally applicable. Homology based approaches can only detect effectors which are members of known effector families, and these are mostly specific for certain well-known bacterial species. Approaches using transcriptional co-regulation need knowledge about a TTSS effector specific promoter which has not yet been described for most bacteria possessing a TTSS. The unusual amino acid composition in the effector N-termini has to date only been described and exploited in screens in P. syringae. Chromosomal co-localization is only applicable if effectors and TTSS related proteins or chaperones are clustered in genomic proximity as described for the pathogenicity islands in Salmonella [19]. However, these pathogenicity islands are absent in other bacteria known to harbour a TTSS such as the Chlamydiae, where the genes encoding known effectors are scattered around the genome [20],[21]. In order to create a general method for the prediction of type III secreted proteins, the most straightforward way would be the identification of a general molecular signal which leads to specific recognition of effector proteins by the TTSS. The molecular structure of such a secretion signal is, however, so far unknown. The binding of specific chaperons has been shown to be necessary in some cases [22] but does not seem to be a general prerequisite. Several studies indicate a signal in the N-terminus either encoded in the underlying mRNA [23],[24] or in the peptide [12],[25],[26]. Subtil et al., for example, successfully screened for TTSS effectors using fusion proteins consisting of a chlamydial N-terminus and a reporter gene in a heterologous Shigella flexneri assay [12]. This experiment showed that the first 15 amino acids are sufficient for the secretion of several chlamydial effectors. In this work we demonstrate that information derived from N-terminal peptides is universally applicable to successfully predict type III secreted proteins. We have implemented EffectiveT3, the first general prediction software for type III effector proteins. This software is based on a machine learning approach and can be applied to single proteins as well as complete proteomes. We investigate the molecular shape (i.e., length, position, composition) of the signal captured by the EffectiveT3 software and demonstrate that the signal is taxonomically universal. We applied the EffectiveT3 software to 739 prokaryotic proteomes and discuss the sizes of predicted secretomes. Results/Discussion Common features of known effector proteins To comprehensively investigate the nature of the TTSS signal, we compiled a database of known effector proteins from members of the phylum Chlamydiae and the genera Escherichia, Yersinia and Pseudomonas by an exhaustive mining of literature. These “animal pathogen” and “plant symbiont” sets consist exclusively of proteins with individual experimental evidence for type III mediated transport and comprise 100 proteins including 48 effectors from animal pathogens/symbionts and 52 effectors from plant symbionts (Table S1). 39 of them can be clustered by sequence similarity into 15 distinct orthologous groups (see Table S2). These orthologous groups, however, turned out to be restricted to their respective taxon. Their members have no counterparts with significant homology over the major part of their sequences in other organisms included in this study. To investigate whether predicted functional interactions based on genomic context methods [27] could be used for the prediction of TTSS effectors, we analyzed all known effectors using the STRING database [28]. A few cases of conserved chromosomal neighbourhood of effectors with structural TTSS proteins or chaperones could be observed, whereas most effectors do not co-evolve with the TTSS (Table S3). The genomic neighbourhood of known effectors has been further examined by statistical analysis of all co-localized proteins. Components of the TTSS are significantly enriched in the proximity of effectors (Table S4). The highest significance of this enrichment has been observed within the range of 30 proteins up- and downstream. Within these neighbours, 7 structural TTSS proteins show individual enrichment of statistical significance (Table S5). However, particularly in genomes encoding the TTSS on the chromosome as e.g. Chlamydiae, the majority of effectors cannot be found in genomic proximity to components of the TTSS (Table S6). Thus we cannot derive a general co-evolution rule for all effectors, which limits the predictive power of genomic context methods significantly. However, the observed co-evolution of certain effectors with each other and the co-localization of several effectors with TTSS components and chaperones make this methodology valuable for situations if such effectors or chaperones are already known or if the TTSS is encoded on a plasmid or on a genomic island. In a next step we analyzed the N-terminal amino acids of known TTSS effectors in greater detail. Within their N-terminal peptides, the effectors did not show any conserved residues in several multiple sequence alignments performed and analyzed (see Figure S1 for an example). The absence of conserved positions indicative of a common sequence motif or domain signature, which could serve as a signal, demonstrated that a conserved binding domain can be excluded as a general TTSS signal. A secretion signal could also be encoded in the secondary structure of the N-terminus. We employed secondary structure predictions and counted the structural features (coil, α-helix, β-sheet) at each residue within the first 25 amino acids. In the known TTSS effectors, 51% coil, 39% α-helix and 10% β-sheet have been predicted. In randomly selected proteins (not known to be secreted via a TTSS) we predicted 39% coil, 45% α-helix and 16% β-sheet, which indicates that coiled regions are enriched in the N-termini of TTSS effectors. These findings fit well with data from P. syringae, a well-studied plant pathogen, for which an unusual amino acid composition in the N-termini of effectors has been reported [16]–[18],[29]. Therefore we tested, whether this unusual amino acid composition is a general feature of effector proteins. A Mann-Whitney test on amino acid frequencies derived from both the whole sequences and the first 25 residues of the N-termini from the effector sets and randomly selected proteins revealed significant enrichments and depletions of certain amino acids in sequences from animal pathogens and plant symbionts, respectively (Figure 1). This effect is particularly strong in the N-terminal end and therefore, this composition bias could reflect an exploitable signal of TTSS mediated transport. The most significant enrichment in the N-termini of effectors of animal pathogens and plant symbionts is that of serine. Threonine and proline are significantly enriched in the effectors of animal pathogens, and leucine is depleted in both animal and plant effector proteins. Notably, the enrichment of proline could explain the enrichment of coiled regions in the N-termini as this amino acid is known to be less frequent in α-helices and β-sheets. Interestingly, these experiments revealed both commonalities and differences between the N-terminus of effector proteins from plant and animal pathogens, respectively. 10.1371/journal.ppat.1000376.g001 Figure 1 Enrichment of amino acids in effector N-termini. Amino acids that are significantly enriched or depleted in the first 25 residues of effectors from the animal pathogen effector set and from the plant symbiont effector set (p-Value 1) are enriched in effector-like sequences in the N-termini. Low Z-Scores indicate the presence of ORFans, which show similar characteristics to type III effectors over their whole length (Table S11). 10.1371/journal.ppat.1000376.g004 Figure 4 Overview of EffectiveT3 predictions in complete genomes from Gram-positive bacteria and archaea. The figure shows the percentage of positive predictions in proteomes from Gram-positive bacteria and archaea, respectively, depending on the G+C content of the genomes. Linear fits are shown by trend lines in the colours of the respective data sets; attached are the coefficients of determination R2 of each fit. The individual results for all proteomes can be found in Table S11. In Gram-negative bacteria, the correlation between the number of positives and the genomic G+C content is much weaker (R2∼0.06) than in Gram-positives (Figures 4 and 5). Additive to the expected false positive rate, most proteomes with TTSS encode more putative effectors than their relatives without TTSS. The missing clear difference between Gram-negatives with and without TTSS may be explained by the noise caused by misannotations which seem to be present in all selected genomes (data not shown). Additionally, putative Type III effectors may not be a unique feature of species encoding a TTSS but could be ubiquitous in a broad range of phylogenetically diverse microbes. This finding would be surprising, but could be explained by the absence of evolutionary pressure on N-termini towards not to be secreted in microorganisms without a TTSS. Additionally, effector proteins might be subject of horizontal gene transfers into genomes without TTSS where they neo-functionalize but keep their N-termini. 10.1371/journal.ppat.1000376.g005 Figure 5 Overview of EffectiveT3 predictions in complete genomes from Gram-negative bacteria with and without TTSS. The figure shows the percentage of positive predictions in proteomes from Gram-negative bacteria with and without TTSS, depending on the G+C content of the genomes. The plot has been scaled as Figure 4 to facilitate comparison. Linear fits are shown by trend lines in the colours of the respective data sets; attached are the coefficients of determination R2 of each fit. The individual results for all proteomes can be found in Table S11. Conclusion The TTSS is a key virulence factor in many important human pathogens, such as Salmonella sp., Yersinia sp., Chlamydiae and E. coli. However, the prediction of TTSS effector proteins was possible so far only on a small taxonomic scale, impeding the study of this important group of virulence factors in newly sequenced genomes of organisms without well-studied close relatives. In this study we describe the identification of taxonomically universal features of TTSS effector proteins, which formed the basis of the development of the program EffectiveT3, the first universally applicable in silico prediction method for TTSS transported proteins. The core of our in silico prediction method consists of a machine learning approach, which behaves like a black-box in the sense that it does not imitate the unknown biological mechanism itself but models regularities in the N-terminal peptides of TTSS effectors. Since the training set comprised no other common feature beside TTSS mediated transport, EffectiveT3 must capture the sequence related parts of the biological signal. In contrast, it has not been possible to learn on equally sized, randomly selected sequences using the same machine learning protocol. Thus the predictive performance cannot result from a selection bias introduced by small training sets. EffectiveT3 performs far above random in the cross-validation as well as on data derived from organisms which were not present in the training set. A certain degree of generality of the TTSS substrate recognition process was already suggested by heterologous secretion assays [12]. Our computational model demonstrates that the signal is indeed highly conserved over a broad taxonomic range, facilitating the prediction of plant symbiont effectors using information derived from animal pathogens (and vice versa). This taxonomic universality of the TTSS secretion signal implies a common mechanism of TTSS substrate recognition across phylogenetically diverse bacterial groups. The great value of the EffectiveT3 method is its independency from sequence similarity to known effectors and the independence of organism specific a priori knowledge. It is therefore suited to the application on newly sequenced genomes from bacteria with a Gram-negative type cell wall and for the detection of novel effector families, which could lead to the discovery of so far unrecognized virulence factors and thus improve our understanding of the ways of host cell manipulation by bacterial pathogens. Since the procedure reveals a substantial fraction of false positive predictions and is intrinsically sensitive to misannotations such as wrongly annotated gene starts and ORFans, the current method should be complemented by specific pre- and postprecessing steps: Before applying EffectiveT3, the gene annotations of the analyzed proteins should be verified to remove ORFans and ensure correct translational start sites. An additional protocol to filter and rank the positive predictions by reliability might include the exclusion of already annotated genes, house-keeping genes and proteins with a signal for other transport routes as the SecA pathway. Particularly in genomes which encode TTSS components on plasmids or genomic islands, the genomic proximity of TTSS components might be enriched in effectors and should be analyzed additionally. The most promising improvement of our computational model would be the consideration of the transcriptional control of effector proteins [36]. It can be expected that genome-wide transcriptional data will become available in the near future for a sufficient number of genomes having known type III effectors. The EffectiveT3 predictions can be accessed online at http://www.chlamydiaedb.org. The software is freely available from the authors upon request. Materials and Methods Data sets The known type-III effector proteins have been collected manually from the literature. Each protein has been included if it has at least one direct evidence for TTSS mediated transport resulting from a single experiment. Not included are proteins, which are part of the TTSS needle complex although some of them are transported by the TTSS and data from large scale screens. By this procedure, we collected a animal pathogen set of 48 proteins comprising the taxa Chlamydia (17 sequences), Salmonella (9 sequences), Yersinia (15 sequences), Escherichia (7 sequences). A representation of this set with only one member of each orthologous group has been created separately. The sequences were downloaded from SWISSPROT/UNIPROT [37] (version as downloaded on 07/30/2008) or, if not contained there, downloaded from RefSeq [38] (version as downloaded on 07/30/2008). We retrieved the plant symbiont set consisting of 52 known Pseudomonas effector proteins from the Pseudomonas syringae Genome Resources database [39] (Hop virulence protein/gene database, downloaded on 07/30/2008). A complete list of used effector sequences is given in the Table S1. All effectors have been examined for correctness of translational start sites by manual inspection of multiple sequence alignments with their homologs. Negative training sets of non-effectors have been created by randomly choosing proteins from the organisms represented in the animal pathogen and plant symbiont sets devoid of the known effectors. Each negative set is twice as large as its corresponding positive set. This procedure has been repeated five times in order to enable investigations on the influence of the negative set on the prediction. Protein sequences from completely sequenced genomes of Yersinia, Escherichia, Salmonella, Pseudomonas, Chlamydia species as well as of gram(+) Bacteria, Archaea and Gammaproteobacteria were downloaded from RefSeq (version as downloaded on 07/30/2008) [40]. The data sets were classified into organism with and without TTSS by manual search in the literature for the case of gram(−) bacteria or generally classified as “without TTSS” in the case of gram(+) bacteria and archaea. A complete list of organisms used is given in the Table S11. A list of proteins building the TTSS system has been obtained by full-text searches against the SIMAP [41] databases using the gene-names of the TTSS compounds as given by KEGG [34]. Grouping of training sets by homology An all-against-all comparison of the full length-sequences using the Smith-Waterman algorithm [42] as implemented in the Jaligner package was performed [43]. For each pair, a similarity score Sratio by dividing the alignment score by the selfscore is computed and sequences are iteratively grouped if they show a Sratio value greater or equal 0.15. This measure is similar to the measure used by Lerat et al. in a study of genome repertoires in bacteria [44] and has been adjusted to maximal sensitivity in the detection of putative orthologs. Secondary structure prediction To predict secondary structure features we used the PSIpred-software [45]. The prediction has been applied to the whole sequences. PSIpred can be applied using alignments to conserved sequences as extrinsic information using PSI-BLAST [46]. For this purpose, we performed PSI-BLAST searches against SWISSPROT/UNIPROT. For the N-terminal ends of the effectors, we did not receive a sufficient amount of alignments to improve the secondary structure prediction at these positions. As a consequence, we only used the ab initio prediction without alignment information. We then counted the fraction for each predicted class in the N-termini as input feature for the prediction pipeline. Multiple alignments of N-termini Multiple alignments have been created using two different methods: ClustalW (Version 2.0.5) [47], and Muscle (Version 3.7) [48] with standard parameters. We randomly chose ten sequences from the sets of known effectors to create multiple alignments and aligned their 10, 20 and 30 first residues. This procedure has been repeated 20 times. We manually checked the alignments for conserved regions similar to a multiple alignment containing a certain domain signature. Example alignments are given in the Figure S1. Statistical enrichment analyses Enrichments and depletions of amino acid properties (frequency, frequency of its representations in a reduced alphabet, frequency of secondary structure properties) have been performed using a one sided Mann Whitney test with p 0.1 over the whole sequence or more than 0.3 in the area of the signal, one of them was discarded from the training set. This has been done to avoid learning protein-families instead of the signal. Sensitivity has been computed as TP/(TP+FN), Selectivity as TN/(TN+FP), with TP = amount true positive predictions, FN = amount false negative predictions, TN = amount true negative predictions, FP = amount false positive predictions. Receiver Operating Statistics to determine the AUC value had been created using the WEKA-toolbox. Precision and Recall are computed separately for both classes, where the AUC describes the overall performance of the classifier. The classification algorithms employed are listed in Table 3. Exploration of optimal position and length To determine the optimal position and length of the signal we applied a sliding window approach varying the start and length of the sequence used for the learning and testing procedure. At each position, the whole procedure of feature selection, removal of similar sequences, training and cross-validation has been repeated. For the position exploration, we used a window of the length 15 which we moved in steps of five residues. The length exploration started with a window of the first ten residues which was elongated by five residues in each round. If a sequence was too short for the range of coordinates in a certain step of this procedure, it has been discarded from the data set. Since we found that the choice of the negative set does not significantly influence the prediction, we used only one negative set in this analysis. Signal robustness The robustness of the signal has been assessed by measuring the fractions of positively predicted instances from the training set after introducing a certain amount of amino acid exchanges in the first 25 residues. We only used these sequences, which are predicted as true positives by the final classification algorithm (full training set, Naïve Bayes algorithm with selective settings [probability for class “secreted” >0.95 using the Naïve Bayesian Classifier]). We mutated the N-terminal sequences (first 25 residues) by introducing point mutations at random positions into the underlying DNA sequences (T,A,C,G exchanged with equal probability of 1/4) which did not result in stop codons but altered the amino acid sequence. After translating the mutated sequence, we measured the fraction of positively predicted effectors after one, five and ten consecutive mutations. In a second strategy we substituted randomly selected amino acids according to their importance for the TTSS signal peptide. Residues which did not belong to the group of depleted amino acids (leucine, glutamic acid, aspartic acid and alanine) were replaced by a randomly selected member of this group of depleted amino acids. Residues which did belong to the group of enriched amino acids (threonine, serine and proline) were replaced by randomly selected amino acids which did not belong to this group of enriched amino acids (the substitution probabilities for the non-enriched amino acids have been derived from their frequency within the complete proteins without the N-terminal ends). The effect of frame shift mutations on the signal We have used a data set given by Ramamurthi et al. [56] of three Yersinia effector proteins with three frame shift mutants for each. We retrained our classifier using the first 15 amino acids instead of the first 25, since only the first 15 residues of the mutants are given in the paper. Simulation of frame shifts has been done by shifting the DNA by one (+1) and two (+2) positions. In order to get a sufficient amount of sequences with sufficient length, appearing stop codons have been replaced by methionine. We used only these effectors, which show a positive prediction with restrictive parameters (probability for class “secreted” >0.95 as reported by the Naïve Bayesian Classifier). As control, we used randomly selected sequences from the same organisms which are covered by the positive set and used only these sequences, which were negatively predicted (probability not secreted >0.95 as reported by the Naïve Bayesian Classifier). Signals conserved after frame shift were detected with the same settings as in the selection procedure. Taxonomic universality of the signal Notably, a conclusion about the signal's generality cannot be deduced by the fact that the classifier performs well in the cross-validation procedure, since the algorithm might detect independent features for each taxon in this procedure. In order to test the universality of the signal, we excluded each taxon (Yersinia, Salmonella, Escherichia, Chlamydia, Pseudomonas) from the training and feature-selection procedure and tested the classifiers performance with this taxon as separate test set. For both sets, negative sets twice as large are randomly created from these organisms, which are also in the respective positive set. The values for the AUC have been computed using the WEKA-toolbox. Final training of the classifier for the prediction of secretomes The final classifier has been obtained using both sets of known effectors and a negative set which was twice as large as the positive set. We used the Naïve Bayes algorithm as it showed the best overall performance in the cross-validation procedure. Again, we excluded similar N-termini and used the first 25 amino acids as primary input. The sequence data of the proteomes has not been pre-filtered or further processed for the prediction of effectors in complete genomes. To investigate the influence of the amino acid frequencies within each proteome, the prediction of effectors has been also performed in pseudo-proteomes, for which all protein sequences have been denaturised by random shuffling. The shuffling process has altered only the order of amino acids within the proteins but not their overall (genome-wide) frequency. Implementation of the effectiveT3 software The EffectiveT3 software is based on the WEKA toolbox and implemented purely in the Java™ programming language. The probability threshold for class “secreted” using the Naïve Bayesian Classifier can be selected by the user in order to adjust the selectivity and sensitivity of the predictions. We offer a web-interface for own predictions at http://www.chlamydiaedb.org. Application of effectiveT3 to complete archaeal and bacterial proteomes Complete genome and proteome data of prokaryotic genomes has been downloaded from the KEGG database [34] (release 2009/01/19). Components of the TTSS have been identified by their association to the KEGG Orthologous Groups (KO) belonging to the TTSS reference pathway KO03070 (K03219..K03230). Genomes in which at least 9 of these 12 KO are present have been considered as genomes with TTSS. Genomes in which less than 6 of these 12 KO are present have been considered as genomes without TTSS. All genomes in which between 6 and 8 of these 12 KO are present have been excluded from this analysis to avoid uncertainty. Additionally, all bacterial genomes have been excluded from this analysis for which no information on cell wall type (Gram-positive vs. Gram-negative) was available at the NCBI Entrez Genome Project Organism Info database [38]. For the remaining 739 proteomes, EffectiveT3 predictions have been calculated using a selective parameter setting (probability for class “secreted” >0.99 using the Naïve Bayesian Classifier). To estimate the enrichment of TTSS effector-like sequences in the N-termini of the proteomes, a genome-wide Z-Score is calculated for every proteome: Z = (N-A)/SD, whereas N denotes the number of positives in the N-termini of the real proteome. A and SD are derived from 50 repetitions predicting positives in randomly chosen segments of 25 aa length (one segment per protein), whereas A corresponds to the average number of positives in the 50 runs and SD to their standard deviation. Supporting Information Figure S1 Example alignment of N-termini. The first 30 residues of non-homologous effector proteins have been aligned using ClustalX with default parameters. (4.31 MB TIF) Click here for additional data file. Figure S2 Example alignments between effector and non-effector orthologs. To investigate the evolutionary acquisition of the signal peptide, a pair wise sequence alignment study counting individual elongations and truncations between effectors and non-effector orthologs has been performed. This figure shows examples of these alignments. A) demonstrates elongation and B) truncation of effector proteins (upper row) aligned with sure non-effector proteins (lower row). (1.31 MB TIF) Click here for additional data file. Figure S3 Robustness of the TTSS secretion signal against point mutations. The diagram depicts the percentage of positively predicted TTSS signals after accumulation of point mutations. The non-targeted mutation strategy exchanged residues accumulatively by random. The targeted mutation strategy favoured to exchange these features, which we found to have the strongest influence on the signal. For both experiments all positively predicted proteins from the animal pathogen and plant symbiont training sets have been used. (0.09 MB TIF) Click here for additional data file. Table S1 Effector and TTSS sequences used in this study. Effector proteins are listed first, then the sequences of the TTSS system and few examples of TTSS related chaperones. The different sets are denoted as follows: A = animal pathogen set, P = plant symbiont set, T = type III secretion system, C = TTSS related chaperone. For each sequence, the first 25 N-terminal amino-acids are given. (0.20 MB DOC) Click here for additional data file. Table S2 Orthologous groups of effector proteins. This table comprises effector proteins with individual experimental evidence for type III mediated transport which can be clustered into orthologous groups (clustered by homology and manual inspection). A sequence is added to a cluster, if it has at least Sratio> = 0.15 to one other cluster member. Sratio is computed as alignment-score/selfscore. (0.08 MB DOC) Click here for additional data file. Table S3 Groups of co-evolving effector and TTSS proteins and examples of co-localized effector proteins and chaperones based on the STRING database. For each group of co-evolving effector and TTSS proteins, gene names of the members are given. The right column indicates, whether the orthologous group comprises effectors, TTSS proteins or TTSS related chaperones. A gene is added to a cluster, if the score of a genomic context method to another member derived from STRING exceeds 0.5. In the last section, examples of co-localized effectors and chaperones are listed. (0.05 MB DOC) Click here for additional data file. Table S4 Number of genomic neighbours of known effectors, number of non-neighbours and their association to the TTSS. For all known effectors from Table S1, genomic neighbours have been determined for a certain distance upstream and downstream on the chromosome or plasmid. These neighbours and the remaining, non-neighboured proteins of the genomes have been distinguished by their association to the TTSS. Components of the TTSS are enriched in the neighbourhood of effectors. The statistical significance of this enrichment has been determined using the t-Test. The most significant enrichment of TTSS components in the genomic neighbourhood of effectors can be observed within the range of 30 neighbours up- and downstream (marked in red). (0.04 MB DOC) Click here for additional data file. Table S5 Enrichment of KEGG orthologous groups within the genomic neighbourhood of known effectors. This table lists KEGG orthologous groups (KO), which are significantly enriched (Bonferroni-corrected t-Test p-Value<0.05) within 30 neighbours up- and downstream of known effectors. (0.03 MB DOC) Click here for additional data file. Table S6 Known effectors and their genomic neighbourhood to TTSS components. The genomic neighbourhood (30 genes up- and downstream) to TTSS components has been evaluated for all known effectors, except on Yersinia pestis KIM due to the absence of the plasmid pCD1 from the KEGG database. The number of effectors which are neighboured to at least one TTSS component is given in the middle column, the remaining effectors are summarized in the right column. (0.04 MB DOC) Click here for additional data file. Table S7 Performance of the classifiers using the C-terminal end. To prove the concept of the N-terminal signal peptide, C-termini should have no predictive power. The performance for several classifiers has been evaluated using exactly the same feature selection, training and test procedure as used for the N-termini. 5 runs with different negative sets have been performed. (0.03 MB DOC) Click here for additional data file. Table S8 Prediction results with EffectiveT3 trained without a certain taxonomic sub-set. EffectiveT3 has been trained without the positive and negative samples from the excluded taxonomic groups listed in this table. Testing EffectiveT3 on these effectors (E) and randomly chosen negative samples (R) resulted in true positive (+E), false negative (−E), false positive (+R) and true negative (−R) predictions. (0.35 MB DOC) Click here for additional data file. Table S9 Pair wise comparison of orthologous effector and non-effector proteins. Truncations, elongations and conservations of the N-terminal length until the first functional domain are listed according to the effector protein (first column) compared to orthologs from non-TTSS bearing organisms. (0.05 MB DOC) Click here for additional data file. Table S10 Effector sequences which tolerate frame shift mutations. The mutations were introduced by either shifting the DNA sequences by one or two bases to the left, stop codons where replaced by Methionine. (0.04 MB DOC) Click here for additional data file. Table S11 EffectiveT3 predictions in complete proteomes. EffectiveT3 predictions for complete proteomes have been grouped by Archaea, Gram-positive and Gram-negative bacteria. Within each group, proteomes are sorted by their taxonomic lineage and species names. For each proteome, the absence (−) or presence (+) of a TTSS, the genomic G+C content, the number of annotated proteins, the percentage of EffectiveT3 positive predictions and the genome-wide Z-Score are given. The presence of the TTSS in the proteomes as determined by KEGG and the hosts are coded by the following colors: black = without TTSS or unknown host; red = with TTSS/animal pathogenic; green = with TTSS/plant symbiotic. (0.98 MB DOC) Click here for additional data file. Table S12 Input features of the machine learning algorithms after initial feature selection. This table comprises these features, which are selected from all possible feature combinations using three different alphabets (amino acid alphabet, amino acid property alphabet, hydrophobic/hydrophilic alphabet) with a maximal pattern length of three. In order to avoid over-fitting on the data, only features are selected which are not specific to either the positive or the negative set but exists in both. (0.07 MB DOC) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              DBatVir: the database of bat-associated viruses

              Emerging infectious diseases remain a significant threat to public health. Most emerging infectious disease agents in humans are of zoonotic origin. Bats are important reservoir hosts of many highly lethal zoonotic viruses and have been implicated in numerous emerging infectious disease events in recent years. It is essential to enhance our knowledge and understanding of the genetic diversity of the bat-associated viruses to prevent future outbreaks. To facilitate further research, we constructed the database of bat-associated viruses (DBatVir). Known viral sequences detected in bat samples were manually collected and curated, along with the related metadata, such as the sampling time, location, bat species and specimen type. Additional information concerning the bats, including common names, diet type, geographic distribution and phylogeny were integrated into the database to bridge the gap between virologists and zoologists. The database currently covers >4100 bat-associated animal viruses of 23 viral families detected from 196 bat species in 69 countries worldwide. It provides an overview and snapshot of the current research regarding bat-associated viruses, which is essential now that the field is rapidly expanding. With a user-friendly interface and integrated online bioinformatics tools, DBatVir provides a convenient and powerful platform for virologists and zoologists to analyze the virome diversity of bats, as well as for epidemiologists and public health researchers to monitor and track current and future bat-related infectious diseases. Database URL: http://www.mgc.ac.cn/DBatVir/
                Bookmark

                Author and article information

                Journal
                Nucleic Acids Res
                Nucleic Acids Res
                nar
                Nucleic Acids Research
                Oxford University Press
                0305-1048
                1362-4962
                08 January 2019
                05 November 2018
                05 November 2018
                : 47
                : Database issue , Database issue
                : D687-D692
                Affiliations
                MOH Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100176, China
                Author notes
                To whom correspondence should be addressed. Tel: +86 10 6787 5146; Fax: +86 10 6787 5146; Email: yangj@ 123456ipbcams.ac.cn . Correspondence may also be addressed to Lihong Chen. Email: chenlh@ 123456ipbcams.ac.cn
                Author information
                http://orcid.org/0000-0002-8826-5198
                Article
                gky1080
                10.1093/nar/gky1080
                6324032
                30395255
                a786ba95-b7b2-43ba-baf0-7c8d78be62a3
                © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@ 123456oup.com

                History
                : 19 October 2018
                : 17 October 2018
                : 12 September 2018
                Page count
                Pages: 6
                Funding
                Funded by: Ministry of Science and Technology of China
                Award ID: 2016YFC1202404
                Award ID: 2015CB554204
                Funded by: CAMS Innovation Fund for Medical Sciences
                Award ID: 2017-I2M-3-017
                Categories
                Database Issue

                Genetics
                Genetics

                Comments

                Comment on this article