7
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Estimating probabilistic context-free grammars for proteins using contact map constraints

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Interactions between amino acids that are close in the spatial structure, but not necessarily in the sequence, play important structural and functional roles in proteins. These non-local interactions ought to be taken into account when modeling collections of proteins. Yet the most popular representations of sets of related protein sequences remain the profile Hidden Markov Models. By modeling independently the distributions of the conserved columns from an underlying multiple sequence alignment of the proteins, these models are unable to capture dependencies between the protein residues. Non-local interactions can be represented by using more expressive grammatical models. However, learning such grammars is difficult. In this work, we propose to use information on protein contacts to facilitate the training of probabilistic context-free grammars representing families of protein sequences. We develop the theory behind the introduction of contact constraints in maximum-likelihood and contrastive estimation schemes and implement it in a machine learning framework for protein grammars. The proposed framework is tested on samples of protein motifs in comparison with learning without contact constraints. The evaluation shows high fidelity of grammatical descriptors to protein structures and improved precision in recognizing sequences. Finally, we present an example of using our method in a practical setting and demonstrate its potential beyond the current state of the art by creating a grammatical model of a meta-family of protein motifs. We conclude that the current piece of research is a significant step towards more flexible and accurate modeling of collections of protein sequences. The software package is made available to the community.

          Related collections

          Most cited references69

          • Record: found
          • Abstract: found
          • Article: not found

          The Protein Data Bank.

          The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
            Bookmark
            • Record: found
            • Abstract: not found
            • Book Chapter: not found

            Protein Identification and Analysis Tools on the ExPASy Server

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Protein homology detection by HMM-HMM comparison.

              Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. We have generalized the alignment of protein sequences with a profile hidden Markov model (HMM) to the case of pairwise alignment of profile HMMs. We present a method for detecting distant homologous relationships between proteins based on this approach. The method (HHsearch) is benchmarked together with BLAST, PSI-BLAST, HMMER and the profile-profile comparison tools PROF_SIM and COMPASS, in an all-against-all comparison of a database of 3691 protein domains from SCOP 1.63 with pairwise sequence identities below 20%.Sensitivity: When the predicted secondary structure is included in the HMMs, HHsearch is able to detect between 2.7 and 4.2 times more homologs than PSI-BLAST or HMMER and between 1.44 and 1.9 times more than COMPASS or PROF_SIM for a rate of false positives of 10%. Approximately half of the improvement over the profile-profile comparison methods is attributable to the use of profile HMMs in place of simple profiles. Alignment quality: Higher sensitivity is mirrored by an increased alignment quality. HHsearch produced 1.2, 1.7 and 3.3 times more good alignments ('balanced' score >0.3) than the next best method (COMPASS), and 1.6, 2.9 and 9.4 times more than PSI-BLAST, at the family, superfamily and fold level, respectively.Speed: HHsearch scans a query of 200 residues against 3691 domains in 33 s on an AMD64 2GHz PC. This is 10 times faster than PROF_SIM and 17 times faster than COMPASS.
                Bookmark

                Author and article information

                Contributors
                Journal
                PeerJ
                PeerJ
                peerj
                peerj
                PeerJ
                PeerJ Inc. (San Diego, USA )
                2167-8359
                18 March 2019
                2019
                : 7
                : e6559
                Affiliations
                [1 ]Wydział Podstawowych Problemów Techniki, Katedra Inżynierii Biomedycznej, Politechnika Wrocławska , Wrocław, Poland
                [2 ]Univ Rennes, Inria, CNRS, IRISA , Rennes, France
                Article
                6559
                10.7717/peerj.6559
                6428041
                659257db-34d8-42dc-8bcd-56eccd769310
                ©2019 Dyrka et al.

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

                History
                : 26 July 2018
                : 3 February 2019
                Funding
                Funded by: National Science Centre, Poland
                Award ID: 2015/17/D/ST6/04054
                Funded by: E-SCIENCE.PL Infrastructure
                Funded by: University of Rennes
                Funded by: Wroclaw Center for Networking and Supercomputing
                Award ID: 98
                This research has been funded by the National Science Centre, Poland (grant no 2015/17/D/ST6/04054) and was supported by the E-SCIENCE.PL Infrastructure. Hugo Talibart is funded by a PhD grant from the University of Rennes. Computational experiments have been partially carried out using resources provided by Wroclaw Centre for Networking and Supercomputing ( http://wcss.pl) (grant no 98). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Bioinformatics
                Mathematical Biology
                Computational Science
                Data Mining and Machine Learning

                structural constraints,syntactic tree,maximum-likelihood estimator,probabilistic context-free grammar,contrastive estimation,protein contact map,protein sequence

                Comments

                Comment on this article