57
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.

          Related collections

          Most cited references49

          • Record: found
          • Abstract: found
          • Article: not found

          Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

          Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences

            Compared to the available protein sequences of different organisms, the number of revealed protein–protein interactions (PPIs) is still very limited. So many computational methods have been developed to facilitate the identification of novel PPIs. However, the methods only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins. In this article, a sequence-based method is proposed by combining a new feature representation using auto covariance (AC) and support vector machine (SVM). AC accounts for the interactions between residues a certain distance apart in the sequence, so this method adequately takes the neighbouring effect into account. When performed on the PPI data of yeast Saccharomyces cerevisiae, the method achieved a very promising prediction result. An independent data set of 11 474 yeast PPIs was used to evaluate this prediction model and the prediction accuracy is 88.09%. The performance of this method is superior to those of the existing sequence-based methods, so it can be a useful supplementary tool for future proteomics studies. The prediction software and all data sets used in this article are freely available at http://www.scucic.cn/Predict_PPI/index.htm.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments.

              Chromatin immunoprecipitation (ChIP) is a well-established procedure to investigate interactions between proteins and DNA. Coupled with whole-genome DNA microarrays, ChIPS allow one to determine the entire spectrum of in vivo DNA binding sites for any given protein. The design and analysis of ChIP-microarray (also called ChIP-chip) experiments differ significantly from the conventions used for locus ChIP approaches and ChIP-chip experiments, and these differences require new methods of analysis. In this light, we review the design of DNA microarrays, the selection of controls, the level of repetition required, and other critical parameters for success in the design and analysis of ChIP-chip experiments, especially those conducted in the context of mammalian or other relatively large genomes.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2014
                24 January 2014
                : 9
                : 1
                : e86703
                Affiliations
                [1]School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, PR China
                University of South Florida College of Medicine, United States of America
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: WL XW BJ HZ. Performed the experiments: WL. Analyzed the data: WL XW FC. Contributed reagents/materials/analysis tools: YC HZ. Wrote the paper: XW HZ.

                Article
                PONE-D-13-40102
                10.1371/journal.pone.0086703
                3901691
                24475169
                dfd4b26f-7219-4653-a1e2-897925b91097
                Copyright @ 2014

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 29 September 2013
                : 10 December 2013
                Page count
                Pages: 10
                Funding
                This work was supported by the National Natural Science Foundation of China (grant no. 61170099) and the Zhejiang Provincial Natural Science Foundation of China (grant no. Y1110840). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology
                Biochemistry
                Proteins
                DNA-binding proteins
                Computational Biology
                Genomics
                Structure Prediction
                Macromolecular Structure Analysis
                Protein Structure
                Computer Science
                Algorithms
                Engineering
                Signal Processing
                Data Mining
                Mathematics
                Applied Mathematics
                Algorithms

                Uncategorized
                Uncategorized

                Comments

                Comment on this article