10
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          To gain insight into the malfunction of the Golgi apparatus and its relationship to various genetic and neurodegenerative diseases, the identification of sub-Golgi proteins, both cis-Golgi and trans-Golgi proteins, is of great significance. In this study, a state-of-art random forests sub-Golgi protein classifier, rfGPT, was developed. The rfGPT used 2-gap dipeptide and split amino acid composition for the feature vectors and was combined with the synthetic minority over-sampling technique (SMOTE) and an analysis of variance (ANOVA) feature selection method. The rfGPT was trained on a sub-Golgi protein sequence data set (137 sequences), with sequence identity less than 25%. For the optimal rfGPT classifier with 93 features, the accuracy (ACC) was 90.5%; the Matthews correlation coefficient (MCC) was 0.811; the sensitivity (Sn) was 92.6%; and the specificity (Sp) was 88.4%. The independent testing scores for the rfGPT were ACC = 90.6%; MCC = 0.696; Sn = 96.1%; and Sp = 69.2%. Although the independent testing accuracy was 4.4% lower than that for the best reported sub-Golgi classifier trained on a data set with 40% sequence identity (304 sequences), the rfGPT is currently the top sub-Golgi protein predictor utilizing feature vectors without any position-specific scoring matrix and its derivative features. Therefore, the rfGPT is a more practical tool, because no sequence alignment is required with tens of millions of protein sequences. To date, the rfGPT is the Golgi classifier with the best independent testing scores, optimized by training on smaller benchmark data sets. Feature importance analysis proves that the non-polar and aliphatic residues composition, the (aromatic residues) + (non-polar, aliphatic residues) dipeptide and aromatic residues composition between NH2-termial and COOH-terminal of protein sequences are the three top biological features for distinguishing the sub-Golgi proteins.

          Related collections

          Most cited references71

          • Record: found
          • Abstract: not found
          • Article: not found

          Recent progress in protein subcellular location prediction.

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Gene2vec: gene subsequence embedding for prediction of mammalian N 6-methyladenosine sites from mRNA

            N 6 -Methyladenosine (m 6 A) refers to methylation modification of the adenosine nucleotide acid at the nitrogen-6 position. Many conventional computational methods for identifying N 6 -methyladenosine sites are limited by the small amount of data available. Taking advantage of the thousands of m 6 A sites detected by high-throughput sequencing, it is now possible to discover the characteristics of m 6 A sequences using deep learning techniques. To the best of our knowledge, our work is the first attempt to use word embedding and deep neural networks for m 6 A prediction from mRNA sequences. Using four deep neural networks, we developed a model inferred from a larger sequence shifting window that can predict m 6 A accurately and robustly. Four prediction schemes were built with various RNA sequence representations and optimized convolutional neural networks. The soft voting results from the four deep networks were shown to outperform all of the state-of-the-art methods. We evaluated these predictors mentioned above on a rigorous independent test data set and proved that our proposed method outperforms the state-of-the-art predictors. The training, independent, and cross-species testing data sets are much larger than in previous studies, which could help to avoid the problem of overfitting. Furthermore, an online prediction web server implementing the four proposed predictors has been built and is available at http://server.malab.cn/Gene2vec/ .
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Prediction of potential disease-associated microRNAs using structural perturbation method

                Bookmark

                Author and article information

                Contributors
                Journal
                Front Bioeng Biotechnol
                Front Bioeng Biotechnol
                Front. Bioeng. Biotechnol.
                Frontiers in Bioengineering and Biotechnology
                Frontiers Media S.A.
                2296-4185
                04 September 2019
                2019
                : 7
                : 215
                Affiliations
                [1] 1Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China , Chengdu, China
                [2] 2Department of Neurology, Heilongjiang Province Land Reclamation Headquarters General Hospital , Harbin, China
                [3] 3Center for Informational Biology, University of Electronic Science and Technology of China , Chengdu, China
                Author notes

                Edited by: Yudong Cai, Shanghai University, China

                Reviewed by: Tao Zeng, Shanghai Institutes for Biological Sciences (CAS), China; Zhiwen Yu, South China University of Technology, China

                *Correspondence: Quan Zou zouquan@ 123456nclab.net

                This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Bioengineering and Biotechnology

                †These authors have contributed equally to this work

                Article
                10.3389/fbioe.2019.00215
                6737778
                31552241
                15adfa45-f2c8-48fe-aec0-a8dae582e6f7
                Copyright © 2019 Lv, Jin, Ding and Zou.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                History
                : 23 July 2019
                : 22 August 2019
                Page count
                Figures: 3, Tables: 3, Equations: 0, References: 85, Pages: 11, Words: 8545
                Categories
                Bioengineering and Biotechnology
                Original Research

                random forests,sub-golgi protein classifier,anova feature selection,split amino acid composition,k-gap dipeptide,synthetic minority over-sampling

                Comments

                Comment on this article