23
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Prediction of protein solvent accessibility using PSO-SVR with multiple sequence-derived features and weighted sliding window scheme

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          The prediction of solvent accessibility could provide valuable clues for analyzing protein structure and functions, such as protein 3-Dimensional structure and B-cell epitope prediction. To fully decipher the protein-protein interaction process, an initial but crucial step is to calculate the protein solvent accessibility, especially when the tertiary structure of the protein is unknown. Although some efforts have been put into the protein solvent accessibility prediction, the performance of existing methods is far from satisfaction.

          Methods

          In order to develop the high-accuracy model, we focus on some possible aspects concerning the prediction performance, including several sequence-derived features, a weighted sliding window scheme and the parameters optimization of machine learning approach. To address above issues, we take following strategies. Firstly, we explore various features which have been observed to be associated with the residue solvent accessibility. These discriminative features include protein evolutionary information, predicted protein secondary structure, native disorder, physicochemical propensities and several sequence-based structural descriptors of residues. Secondly, the different contributions of adjacent residues in sliding window are observed, thus a weighted sliding window scheme is proposed to differentiate the contributions of adjacent residues on the central residue. Thirdly, particle swarm optimization (PSO) is employed to search the global best parameters for the proposed predictor.

          Results

          Evaluated by 3-fold cross-validation, our method achieves the mean absolute error (MAE) of 14.1% and the person correlation coefficient (PCC) of 0.75 for our new-compiled dataset. When compared with the state-of-the-art prediction models in the two benchmark datasets, our method demonstrates better performance. Experimental results demonstrate that our PSAP achieves high performances and outperforms many existing predictors. A web server called PSAP is built and freely available at http://59.73.198.144:8088/SolventAccessibility/.

          Electronic supplementary material

          The online version of this article (doi:10.1186/s13040-014-0031-3) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references44

          • Record: found
          • Abstract: found
          • Article: not found

          PISCES: a protein sequence culling server.

          PISCES is a public server for culling sets of protein sequences from the Protein Data Bank (PDB) by sequence identity and structural quality criteria. PISCES can provide lists culled from the entire PDB or from lists of PDB entries or chains provided by the user. The sequence identities are obtained from PSI-BLAST alignments with position-specific substitution matrices derived from the non-redundant protein sequence database. PISCES therefore provides better lists than servers that use BLAST, which is unable to identify many relationships below 40% sequence identity and often overestimates sequence identity by aligning only well-conserved fragments. PDB sequences are updated weekly. PISCES can also cull non-PDB sequences provided by the user as a list of GenBank identifiers, a FASTA format file, or BLAST/PSI-BLAST output.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A generic method for assignment of reliability scores applied to solvent accessibility predictions

            Background Estimation of the reliability of specific real value predictions is nontrivial and the efficacy of this is often questionable. It is important to know if you can trust a given prediction and therefore the best methods associate a prediction with a reliability score or index. For discrete qualitative predictions, the reliability is conventionally estimated as the difference between output scores of selected classes. Such an approach is not feasible for methods that predict a biological feature as a single real value rather than a classification. As a solution to this challenge, we have implemented a method that predicts the relative surface accessibility of an amino acid and simultaneously predicts the reliability for each prediction, in the form of a Z-score. Results An ensemble of artificial neural networks has been trained on a set of experimentally solved protein structures to predict the relative exposure of the amino acids. The method assigns a reliability score to each surface accessibility prediction as an inherent part of the training process. This is in contrast to the most commonly used procedures where reliabilities are obtained by post-processing the output. Conclusion The performance of the neural networks was evaluated on a commonly used set of sequences known as the CB513 set. An overall Pearson's correlation coefficient of 0.72 was obtained, which is comparable to the performance of the currently best public available method, Real-SPINE. Both methods associate a reliability score with the individual predictions. However, our implementation of reliability scores in the form of a Z-score is shown to be the more informative measure for discriminating good predictions from bad ones in the entire range from completely buried to fully exposed amino acids. This is evident when comparing the Pearson's correlation coefficient for the upper 20% of predictions sorted according to reliability. For this subset, values of 0.79 and 0.74 are obtained using our and the compared method, respectively. This tendency is true for any selected subset.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Intrinsic Disorder Is a Common Feature of Hub Proteins from Four Eukaryotic Interactomes

              Introduction Systematic binary protein–protein interaction maps with various percentages of proteome coverage are currently available for S. cerevisiae [1,2], C. elegans [3], D. melanogaster [4], H. pylori [5], and, most recently, for H. sapiens [6,7]. As a result of these studies, it is now proposed that most networks within the cell have similar overall broad-scale topology where most proteins interact with just a few partners and a small number of proteins interact with many partners. Although all currently available networks represent only samples of the complete interactomes [8], the investigation of such partial networks is a first step toward a systems-biology understanding of cells and organisms. While much has been learned to date about the general mechanisms of protein–protein interactions, the specific structural features that account for differences in protein interactivity are still unknown. It has recently been suggested that intrinsically disordered (ID) proteins play an important role in protein–protein interactions [9,10]. ID proteins and protein regions lack a unique 3-D structure and exist in a dynamic ensemble of conformations. More than 427 proteins containing 802 disordered regions have been annotated (http://www.disprot.org). Computational estimates suggest that eukaryotic proteomes have a significantly higher occurrence of ID proteins relative to prokaryotic proteomes [11,12]. The prevalence of ID proteins in eukaryotes is likely to be due to more complex signaling and regulatory pathways that heavily rely on disordered proteins [13]. Many ID proteins have been shown to mediate interactions through a disorder-to-order transition upon binding to their biological targets [14,15]. The lack of prior structure provides several advantages to ID-mediated protein interactions relative to interactions between folded proteins, such as decoupling of specificity and affinity, and the ability to recognize multiple binding partners with distinct interaction surfaces. In addition, the interaction interface areas of ID proteins is in general much larger per residue [16], which suggests that ID proteins would make more efficient hub proteins relative to ordered proteins [17]. Recent reviews [18–20] discuss the importance of intrinsic disorder for protein–protein interactions that involve binding to multiple partners. These reviews focus on individual examples of hub proteins with known disordered regions. However, no systematic study of organism-specific protein interaction networks that investigate their disorder content is currently available. The hypothesis that intrinsic structural disorder may be an important attribute that can distinguish between hub and end proteins is tested in the present study. The prediction of disorder in the interaction networks from four eukaryotic organisms is carried out using PONDR VL-XT [21,22]. The comparison of proteins from these networks shows that while the disorder content varies between organisms, hub proteins are consistently found to be more disordered than end proteins in all organisms. Results Datasets Characterization Protein interaction datasets from four eukaryotes, C. elegans (WORM), H. sapiens (HUMAN), D. melanogaster (FLY), and S. cerevisiae (YEAST) were selected for this study (Table 1). High-throughput datasets with experimentally demonstrated verification rates between 75% and 80% were selected for WORM and HUMAN; the literature-curated low-throughput dataset was selected for YEAST; and the literature-curated dataset that also included the high-throughput interactions was selected for FLY (Materials and Methods). Although another FLY dataset consisting of only high-confidence high-throughput interactions also was available [4], it was not particularly useful because highly connected proteins (i.e., hubs) were removed with the intent of reducing the number of nonspecific interactions. Subsequently, four additional datasets (WORM BioGRID, HUMAN HPRD, FLY BioGRID, and YEAST BioGRID) from two public databases [23,24], to which no confidence-based filtering have been applied (Materials and Methods, Table S1), were investigated for comparison. From these datasets, ends and hubs were defined as proteins with one and ten or more interacting partners, respectively. Although this cutoff was chosen somewhat arbitrarily, the results of future analysis did not depend significantly on the cutoff value (unpublished data). The gap in the definition between hubs and ends was intended to buffer the classes, and should be considered as a conservative classification of hubs and ends. As shown in Table 1, the number of ends is between ~2-fold to 10-fold greater than the number of hubs, which is consistent with a scale-free network topology. Analysis of Disorder Predictions Predictions of intrinsic structural disorder were carried out on four datasets using PONDR VL-XT [21,22]. As shown in Figure 1, significant differences between hubs and ends in the percentages of proteins containing predicted disordered regions of various lengths are observed. For example, 78% of hub proteins in WORM carry predicted disordered regions of ≥30 consecutive residues, whereas only 58% of end proteins have this characteristic. The prediction error rate of PONDR VL-XT (i.e., the prediction of disorder on the completely ordered dataset O_PDB_S25, see Materials and Methods) for this disorder length is ~13%, and it gradually decreases as the length of the predicted disordered region increases. The significant differences in the disorder content between WORM hubs and ends are observed for most disorder lengths, thereby indicating that WORM hub proteins are overall more disordered than WORM end proteins. Similar conclusions arise for two other organisms, HUMAN and FLY. In YEAST, however, significant differences in the percentages of proteins with predicted disordered regions are observed for only two disorder lengths (≥40 and ≥70). By comparison, the analysis of a much larger YEAST BioGRID dataset shows that the disorder content of hubs and ends is significant for all disorder lengths for this organism (Figure S1). In addition, the results of the disorder predictions for the remaining BioGRID and HPRD datasets (Table S1) are also consistent with a significantly greater amount of disorder in hubs relative to ends (Figure S1). When hubs from all four organisms are compared with each other, HUMAN hubs have the overall highest disorder content (i.e., higher percentage of proteins with predicted disordered regions) for all disorder lengths, whereas YEAST hubs have the lowest. Interestingly, when ends from all four organisms are compared with each other, HUMAN ends again have the highest disorder content. This suggests that the HUMAN interaction network has the highest disorder content among all studied organisms, in agreement with the predicted disorder content of the entire human proteome [12]. It should also be noted that disorder predictions for proteins with an intermediate number of partners (from 2 to 9) generally fall in between the predictions for hubs and ends (Figures S2 and S3). Since PONDR VL-XT predicts disorder on a per-residue basis, it is important to account for the differences in protein lengths when comparing predictions for entire datasets, because longer proteins are expected to have a greater number, as well as longer regions, of predicted disorder in comparison with shorter proteins. To compensate for the length dependency of disorder predictions, the per-residue disorder predictions were normalized by protein length. The percentages of disordered residues within segments of all possible lengths (starting from one and ending with the longest disordered region in the dataset) were calculated for all proteins, and then plotted against the predicted disordered region length (Figure 2). The same procedure was repeated using a completely ordered set of proteins (O_PDB_S25) to estimate the error rate of the predictions. The length-normalized predictions further confirm the differences in the disorder content of hubs and ends. The percentages of predicted disordered residues in hubs are generally higher than in ends (Figure 2), although the differences between hubs and ends are more apparent for the HUMAN and FLY than for the WORM and YEAST datasets. Furthermore, WORM hubs and ends have similar percentages of predicted disordered residues within long segments of disorder (80 residues and longer). When length-normalized predictions are considered, the proportion of predicted disordered residues is highest in the HUMAN dataset, and lowest in the YEAST dataset (Figure 2). Analysis of Various Disorder Parameters To determine which specific disorder attributes contribute toward the differences observed between hubs and ends in each dataset, seven additional disorder parameters were calculated (see Materials and Methods for details). The results of a t-test for three representative disorder attributes (RdisAA, avgScore, and RnumDR) are shown in Table 2. The average disorder scores for hubs and ends were significantly different in all four organisms (p Yeast > Worm” [11] (note that human genome was not available at that time), and “Human > Fly > Yeast > Worm” [12]. Interestingly, when the prediction of disorder was carried out on all proteins (hubs, ends, and proteins with two to nine partners) from the networks in the present study (unpublished data), the ranking “Human > Fly > Yeast > Worm” agreed with the previous studies that were carried out on complete genomes. At the same time, the relative percentages of predicted disorder in the networks were generally higher than those reported previously for the complete genomes [11], even though the same predictor PONDR VL-XT was used in both studies. This result may indicate that proteins that interact with other proteins are on average more disordered than proteins that interact with ligands, such as nucleic acids, small molecules, lipids, etc. Another interesting observation that follows from comparison of the networks to the complete genomes is that the disorder content of the proteomes is closer to the disorder content of ends than to the disorder content of hubs (unpublished data). Although differing views regarding the scale-free nature of the protein interaction networks exist [40,41], it is still tempting to speculate that this bias could be explained by a potentially higher fraction of ends as compared with hubs in all genomes. We previously determined that human cell signaling and cancer-associated proteins are significantly more disordered than proteins from other functional categories [13]. Interestingly, the disorder content of HUMAN hubs (Figure 1) is very similar to that of human regulatory and cancer-associated proteins, suggesting that many cell signaling and regulatory proteins are network hubs. The high disorder content of hubs relates directly to their function. Intrinsic disorder provides several important functional benefits for interactions with multiple partners. First, it allows hubs to adapt to the structure of a variety of differently shaped binding partners. Such structural malleability is especially important for hubs that interact with their partners using the same or overlapping binding surfaces. Second, disorder may enable a hub protein to elicit both inhibiting and activating effects on different partners, as was recently noted for moonlighting proteins [42]. Third, structural plasticity may enable some proteins to serve as hubs in multiple and distinct signaling networks. One example of such a hub is glycogen synthase kinase 3β, which uses two different ID regions to participate in two unrelated signaling pathways, Wnt and insulin signaling [18]. While intrinsic disorder is an important feature of hub proteins, many ordered hub proteins also exist [18]. Interestingly, it has been recently proposed that ordered hubs have higher surface charge than nonhub proteins, and that this increased charge is likely to have an impact on their binding ability [43]. Furthermore, it has been noted that the binding partners of several ordered hubs are intrinsically disordered [18]. The examples include the partners of 14-3-3 proteins [44] the partners of β-catenin [45], and the partners of several other proteins (such as calmodulin, actin, and Cdk) [18]. The results of the present study suggest that wholly ordered hubs, as defined by the CDF/CH consensus classification, constitute a substantial fraction of all hub proteins and are especially prevalent in the YEAST dataset (Table 3). Among all the networks examined here, the YEAST interaction network appears to exhibit the smallest difference between hubs and ends in terms of predicted disorder, at least when literature-curated interactions are considered (Figure 1, Tables 2 and 3, compare with Figures S1 and S3). Notably, the amino acid composition of proteins from the YEAST network appears to be the least similar to the three other organisms (Figure 3). In addition, the proportion of wholly ordered proteins within both YEAST hubs and YEAST ends is the highest among the four datasets (Table 3, Table S3). A plausible explanation of the smaller differences in disorder content of YEAST hubs and ends is that the interactomes of the unicellular organisms are inherently simpler than metazoan interactomes due to less sophisticated signaling and regulation pathways. Because of their greater simplicity, these yeast pathways may rely less heavily on disorder than the networks of higher eukaryotes. In summary, the present study shows that intrinsic structural disorder is a distinctive and common characteristic of eukaryotic hub proteins, and it suggests that disorder may serve as a determinant of protein interactivity. In the future, it would be interesting to compare more specialized signaling and metabolic networks to each other to determine whether the high disorder content of hubs is a common feature of all cellular networks. In addition, it would certainly be interesting to perform the disorder analysis on the complete interactomes (when they are available) to determine whether similar conclusions are reached. Materials and Methods Datasets. The protein–protein interaction datasets for each organism (Table 1) were constructed as follows: (i) The interaction dataset for C. elegans (WORM) corresponds to the “First-Pass” interactions of the worm interactome version 5, or “WI5” [3]; (ii) The interaction dataset for H. sapiens (HUMAN) represents a union of the CCSB human interactome version 1, or “CCSB-HI1” extracted from Rual et al. [6] and high-confidence interactions with three or more quality points extracted from Stelzl et al. [7]; (iii) The interaction dataset for D. melanogaster (FLY) represents a union of literature-curated Drosophila interactions stored in the BIND (http://www.bind.ca), DIP (http://dip.doe-mbi.ucla.edu), and MINT (http://mint.bio.uniroma2.it/mint) interactions databases; (iv) The interaction dataset for S. cerevisiae (YEAST) represents the union of literature-curated yeast interactions stored in the BIND, DIP, and MINT interactions databases; (v) The dataset O_PDB_S25 contains only ordered parts of proteins extracted from the database PDB Select 25 [28]. The disorder predictions on this mostly nonredundant dataset served as a control for estimating the false-positive prediction error rate; (vi) DisProt dataset consists of experimentally verified disordered protein regions extracted from the DisProt database [27]. Four additional datasets, WORM BioGRID, HUMAN HPRD, FLY BioGRID, and YEAST BioGRID (Table S1), to which no confidence-based filtering have been applied, were extracted from BioGRID [23] and HPRD [24] and used for comparison. The redundancy removal from all datasets did not significantly reduce the number of interactions. On average, only 2.2% of interactions were removed at 70% protein sequence identity level, and 15.6% of interactions were removed at 30% protein sequence identity level (unpublished data). Therefore, the original datasets were used in the present study. Since a clear definition of a hub protein, in terms of a number of interacting partners, is not well-established, and since the definition might vary from one dataset to another, we somewhat arbitrarily chose ten partners as a cutoff value and defined proteins with ≥10 partners as hubs. Proteins with one interacting partner are defined here as ends. However, it should be mentioned that varying the cutoffs of hub definition gives rise to similar results (Figures S2 and S3). Disorder predictions. Predictions of intrinsic disorder were carried out using a well-characterized disorder predictor PONDR VL-XT [21,22]. This predictor was trained on the experimentally (X-ray and NMR) confirmed disordered protein regions, while the ordered training set included completely ordered proteins extracted from the nonredundant set of proteins from PDB Select 25 [28]. The accuracy of this predictor, benchmarked on the 42 CASP5 targets, reached 72.8% [46]. PONDR VL-XT is currently being used successfully to guide the removal of disordered regions that interfere with crystallization of problematic proteins for high-throughput structure determination [47]. Access to PONDR VL-XT (http://www.pondr.com) was provided by Molecular Kinetics (Indianapolis, Indiana, United States). Disorder parameters. The following disorder parameters (Table 2) have been calculated for all studied datasets: (i) disAA, the number of predicted disordered residues in the protein; (ii) avgScore, the average disorder prediction score for an entire protein; (iii) shortDR, the number of continuous, predicted disordered regions of length 10–30 amino acids; (iv) medDR, the number of continuous, predicted disordered regions of length 31–60 amino acids; (v) longDR, the number of continuous, predicted disordered regions of length 61–longest DR; (vi) numDR, the number of continuous, predicted disordered regions of length 10–longest DR; (vii) maxDR, the longest predicted disordered region in the protein. To eliminate the dependency of calculated parameters on protein length, the relative values of the attributes (RdisAA, RshortDR, RmedDR, RlongDR, RallDR, and RmaxDR) were derived by dividing the numerical value of each attribute by the protein length. Student's t-test was used to calculate p-values in Table 2. Consensus classification. Predictions of wholly ordered and wholly disordered proteins (Table 3) were made as previously described [25]. Briefly, these predictions assume that proteins fall into one of two classes: wholly disordered or wholly ordered. PONDR VL-XT CDF classification [11] and CH classification [48] were used to make predictions based on the consensus between the two methods. A degree of confidence was derived for both methods, and, for the purposes of consensus prediction, predictions were taken as being either high or low confidence. If both methods agree, a protein is assigned to that class. If one method gives a high confidence prediction and the other a low confidence prediction, a protein is assigned the class indicated by the high confidence prediction. Finally, if the methods disagree and both give either high confidence or low confidence prediction, the protein is left unclassified. The normal test for two binomial proportions was used to calculate 95% confidence intervals and p-values for Table 3. Amino acid composition. The amino acid composition analysis was performed as previously described [22]. Briefly, the mole fraction of the amino acid in a database was calculated as: where Pji is the frequency of amino acid j in sequence i of length ni . The variances of the amino acids in the dataset were calculated as: where Var(Pji ) = Pji (1 – Pji )/ni . The fractional difference in composition between two datasets a and b was calculated as . The variances for these ratios were calculated as: where is the mole fraction of amino acid j in the dataset a, and is the variance of amino acid j in the dataset a. GO annotations. Gene Ontology (GO) [49] annotations for S. cerevisiae [29] were obtained from the GOA database [50]. The correlation between PONDR VL-XT disorder predictions and process/function/localization GO annotations were determined using an approach related to Fisher's permutation test [51]. This approach has been previously used to examine the association of disorder predictions and GO annotations [12]. In this test, a null distribution, which assumes no association between disorder predictions and annotations, is generated. Disorder predictions for adjacent residues are highly correlated due to overlapping compositional windows. To partially account for this, the observed disordered regions (rather than individual residue predictions) were permuted. Predicted disordered regions were randomly distributed 10,000 times for hubs and ends separately, and the number of disordered residues associated with specific annotations was counted. This null distribution was used to calculate a Z-score for the observed counts for each annotation, and significance was evaluated based on the number of trials that contradicted the hypothesis indicated by the Z-score. The calculated p-values have not been corrected for multiple testing. High-level GO annotations of interest were selected prior to testing, and results were restricted to annotations with at least five examples in each of the hubs and ends sets. Supporting Information Figure S1 The Percentages of Hub and End Proteins from BioGRID and HPRD with ≥30 to ≥100 Consecutive Residues Predicted to Be Disordered 95% confidence intervals were calculated using normal test for two binomial proportions. (911 KB EPS) Click here for additional data file. Figure S2 The Percentages of All Interacting Proteins from Four Datasets with ≥30 to ≥100 Consecutive Residues Predicted to Be Disordered 95% confidence intervals were calculated using normal test for two binomial proportions. (982 KB EPS) Click here for additional data file. Figure S3 The Percentages of All Interacting Proteins from BioGRID and HPRD with ≥30 to ≥100 Consecutive Residues Predicted to Be Disordered 95% confidence intervals were calculated using normal test for two binomial proportions. (913 KB EPS) Click here for additional data file. Table S1 Properties of Protein Interaction Datasets Derived from BioGRID and HPRD (19 KB XLS) Click here for additional data file. Table S2 Disorder Attributes Calculated for Four Datasets (23 KB XLS) Click here for additional data file. Table S3 Results of a Binary Classification Using Consensus Method on BioGRID and HPRD Datasets The percentages of ordered, disordered, and unclassified proteins in each dataset are shown. (19 KB XLS) Click here for additional data file. Accession Numbers Swiss-Prot (http://www.ebi.ac.uk/swissprot) accession numbers for proteins mentioned in this paper are: Abp1p (P15891), Act1p (P60010), Arp2 and Arp3 (P32381, P47117), Cmd1p (P06787), FlgM (P26477), Las17p (Q12446), Rvs167p (P39743), and Sla1p (P32790).
                Bookmark

                Author and article information

                Contributors
                zhangj943@nenu.edu.cn
                whchen_nenu@yeah.net
                ppsun_nenu@yeah.net
                xwzhao_nenu@yeah.net
                zhiqiangma.nenu@gmail.com
                Journal
                BioData Min
                BioData Min
                BioData Mining
                BioMed Central (London )
                1756-0381
                31 January 2015
                31 January 2015
                2015
                : 8
                : 3
                Affiliations
                [ ]School of Computer Science and Information Technology, Northeast Normal University, Changchun, 1300117 P.R. China
                [ ]School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland Australia
                [ ]The Engineering Laboratory for Drug-Gene and Protein Screening, Northeast Normal University, Changchun, 130117 P.R. China
                Article
                31
                10.1186/s13040-014-0031-3
                4608127
                26478747
                0e15d0aa-f9cd-4a83-90e5-0a392c033594
                © Zhang et al. 2015

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 18 March 2014
                : 4 December 2014
                Categories
                Research
                Custom metadata
                © The Author(s) 2015

                Bioinformatics & Computational biology
                solvent accessibility,support vector regression,protein sequence,particle swarm optimization

                Comments

                Comment on this article