20
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Detecting sequence signals in targeting peptides using deep learning

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          During the development of TargetP 2.0, a state-of-the-art method to predict targeting signal, we find a previously overlooked biological signal for subcellular targeting using the output from a deep learning method.

          Abstract

          In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria, and chloroplasts or other plastids. By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, that is, the one following the initial methionine, has a strong influence on the classification. We observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position 2, compared with 20% in other plant proteins. We also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide. The importance of this feature for predictions has not been highlighted before.

          Related collections

          Most cited references26

          • Record: found
          • Abstract: found
          • Article: not found

          Predotar: A tool for rapidly screening proteomes for N-terminal targeting sequences.

          Probably more than 25% of the proteins encoded by the nuclear genomes of multicellular eukaryotes are targeted to membrane-bound compartments by N-terminal targeting signals. The major signals are those for the endoplasmic reticulum, the mitochondria, and in plants, plastids. The most abundant of these targeted proteins are well-known and well-studied, but a large proportion remain unknown, including most of those involved in regulation of organellar gene expression or regulation of biochemical pathways. The discovery and characterization of these proteins by biochemical means will be long and difficult. An alternative method is to identify candidate organellar proteins via their characteristic N-terminal targeting sequences. We have developed a neural network-based approach (Predotar--Prediction of Organelle Targeting sequences) for identifying genes encoding these proteins amongst eukaryotic genome sequences. The power of this approach for identifying and annotating novel gene families has been illustrated by the discovery of the pentatricopeptide repeat family.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion

            Seq2Logo is a web-based sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences. Accurate generation of sequence logos is often compromised by sequence redundancy and low number of observations. Moreover, most methods available for sequence logo generation focus on displaying the position-specific enrichment of amino acids, discarding the equally valuable information related to amino acid depletion. Seq2logo aims at resolving these issues allowing the user to include sequence weighting to correct for data redundancy, pseudo counts to correct for low number of observations and different logotype representations each capturing different aspects related to amino acid enrichment and depletion. Besides allowing input in the format of peptides and MSA, Seq2Logo accepts input as Blast sequence profiles, providing easy access for non-expert end-users to characterize and identify functionally conserved/variable amino acids in any given protein of interest. The output from the server is a sequence logo and a PSSM. Seq2Logo is available at http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 2012, date last accessed).
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Sorting Signals, N-Terminal Modifications and Abundance of the Chloroplast Proteome

              Characterization of the chloroplast proteome is needed to understand the essential contribution of the chloroplast to plant growth and development. Here we present a large scale analysis by nanoLC-Q-TOF and nanoLC-LTQ-Orbitrap mass spectrometry (MS) of ten independent chloroplast preparations from Arabidopsis thaliana which unambiguously identified 1325 proteins. Novel proteins include various kinases and putative nucleotide binding proteins. Based on repeated and independent MS based protein identifications requiring multiple matched peptide sequences, as well as literature, 916 nuclear-encoded proteins were assigned with high confidence to the plastid, of which 86% had a predicted chloroplast transit peptide (cTP). The protein abundance of soluble stromal proteins was calculated from normalized spectral counts from LTQ-Obitrap analysis and was found to cover four orders of magnitude. Comparison to gel-based quantification demonstrates that ‘spectral counting’ can provide large scale protein quantification for Arabidopsis. This quantitative information was used to determine possible biases for protein targeting prediction by TargetP and also to understand the significance of protein contaminants. The abundance data for 550 stromal proteins was used to understand abundance of metabolic pathways and chloroplast processes. We highlight the abundance of 48 stromal proteins involved in post-translational proteome homeostasis (including aminopeptidases, proteases, deformylases, chaperones, protein sorting components) and discuss the biological implications. N-terminal modifications were identified for a subset of nuclear- and chloroplast-encoded proteins and a novel N-terminal acetylation motif was discovered. Analysis of cTPs and their cleavage sites of Arabidopsis chloroplast proteins, as well as their predicted rice homologues, identified new species-dependent features, which will facilitate improved subcellular localization prediction. No evidence was found for suggested targeting via the secretory system. This study provides the most comprehensive chloroplast proteome analysis to date and an expanded Plant Proteome Database (PPDB) in which all MS data are projected on identified gene models.
                Bookmark

                Author and article information

                Journal
                Life Sci Alliance
                Life Sci Alliance
                lsa
                lsa
                Life Science Alliance
                Life Science Alliance LLC
                2575-1077
                30 September 2019
                October 2019
                30 September 2019
                : 2
                : 5
                : e201900429
                Affiliations
                [1 ]Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
                [2 ]Science for Life Laboratory, Solna, Sweden
                [3 ]Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
                [4 ]Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH—Royal Institute of Technology, Stockholm, Sweden
                [5 ]DTU Compute, Technical University of Denmark, Kongen Lyngby, Denmark
                [6 ]Computational and RNA Biology, University of Copenhagen, Copenhagen, Denmark
                [7 ]Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark
                Author notes
                Correspondence: arne@ 123456bioinfo.se
                [*]

                Jose Juan Almagro Armenteros and Marco Salvatore contributed equally to this work

                Author information
                https://orcid.org/0000-0001-5775-0417
                https://orcid.org/0000-0002-8879-9245
                https://orcid.org/0000-0002-4490-8569
                https://orcid.org/0000-0002-7115-9751
                https://orcid.org/0000-0002-9412-9643
                Article
                LSA-2019-00429
                10.26508/lsa.201900429
                6769257
                31570514
                d59b0933-9f5e-48ed-ab86-7e4bb6b0777e
                © 2019 Armenteros et al.

                This article is available under a Creative Commons License (Attribution 4.0 International, as described at https://creativecommons.org/licenses/by/4.0/).

                History
                : 15 May 2019
                : 18 September 2019
                : 18 September 2019
                Funding
                Funded by: Swedish National Research Council;
                Award ID: VR-NT-2016-03798
                Award Recipient :
                Categories
                Method
                Methods
                26

                Comments

                Comment on this article