10
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      UDSMProt: universal deep sequence models for protein classification

      research-article
      , , ,
      Bioinformatics
      Oxford University Press

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Inferring the properties of a protein from its amino acid sequence is one of the key problems in bioinformatics. Most state-of-the-art approaches for protein classification are tailored to single classification tasks and rely on handcrafted features, such as position-specific-scoring matrices from expensive database searches. We argue that this level of performance can be reached or even be surpassed by learning a task-agnostic representation once, using self-supervised language modeling, and transferring it to specific tasks by a simple fine-tuning step.

          Results

          We put forward a universal deep sequence model that is pre-trained on unlabeled protein sequences from Swiss-Prot and fine-tuned on protein classification tasks. We apply it to three prototypical tasks, namely enzyme class prediction, gene ontology prediction and remote homology and fold detection. The proposed method performs on par with state-of-the-art algorithms that were tailored to these specific tasks or, for two out of three tasks, even outperforms them. These results stress the possibility of inferring protein properties from the sequence alone and, on more general grounds, the prospects of modern natural language processing methods in omics. Moreover, we illustrate the prospects for explainable machine learning methods in this field by selected case studies.

          Availability and implementation

          Source code is available under https://github.com/nstrodt/UDSMProt.

          Supplementary information

          Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references21

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          ProtVec: A Continuous Distributed Representation of Biological Sequences

          We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%+-0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier

            Abstract Motivation A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. Results We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein–protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Availability and implementation Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo Contact robert.hoehndorf@kaust.edu.sa Supplementary information Supplementary data are available at Bioinformatics online.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The classification of amino acid conservation.

              A classification of amino acid type is described which is based on a synthesis of physico-chemical and mutation data. This is organised in the form of a Venn diagram from which sub-sets are derived that include groups of amino acids likely to be conserved for similar structural reasons. These sets are used to describe conservation in aligned sequences by allocating to each position the smallest set that contains all the residue types brought together by alignment. This minimal set assignment provides a simple way of reducing the information contained in a sequence alignment to a form which can be analysed by computer yet remains readable.
                Bookmark

                Author and article information

                Contributors
                Role: Associate Editor
                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                15 April 2020
                08 January 2020
                08 January 2020
                : 36
                : 8
                : 2401-2409
                Affiliations
                Department of Video Coding & Analytics, Fraunhofer Heinrich Hertz Institute , Berlin 10587, Germany
                Author notes
                Author information
                http://orcid.org/0000-0003-4447-0162
                Article
                btaa003
                10.1093/bioinformatics/btaa003
                7178389
                31913448
                cf91a4c5-842b-401b-9b97-e214d614ac74
                © The Author(s) 2020. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 02 September 2019
                : 13 December 2019
                : 02 January 2020
                Page count
                Pages: 9
                Funding
                Funded by: Berlin Big Data Center;
                Award ID: 01IS14013A
                Funded by: Berlin Center for Machine Learning;
                Award ID: 01IS18037I
                Categories
                Original Papers
                Sequence Analysis

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article