31
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for ( i) the prediction of residue-residue contacts in proteins, and ( ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code.

          Related collections

          Most cited references23

          • Record: found
          • Abstract: found
          • Article: not found

          Profile hidden Markov models.

          S. Eddy (1998)
          The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Protein 3D Structure Computed from Evolutionary Sequence Variation

            The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

              The accurate prediction of residue-residue contacts, critical for maintaining the native fold of a protein, remains an open problem in the field of structural bioinformatics. Interest in this long-standing problem has increased recently with algorithmic improvements and the rapid growth in the sizes of sequence families. Progress could have major impacts in both structure and function prediction to name but two benefits. Sequence-based contact predictions are usually made by identifying correlated mutations within multiple sequence alignments (MSAs), most commonly through the information-theoretic approach of calculating mutual information between pairs of sites in proteins. These predictions are often inaccurate because the true covariation signal in the MSA is often masked by biases from many ancillary indirect-coupling or phylogenetic effects. Here we present a novel method, PSICOV, which introduces the use of sparse inverse covariance estimation to the problem of protein contact prediction. Our method builds on work which had previously demonstrated corrections for phylogenetic and entropic correlation noise and allows accurate discrimination of direct from indirectly coupled mutation correlations in the MSA. PSICOV displays a mean precision substantially better than the best performing normalized mutual information approach and Bayesian networks. For 118 out of 150 targets, the L/5 (i.e. top-L/5 predictions for a protein of length L) precision for long-range contacts (sequence separation >23) was ≥ 0.5, which represents an improvement sufficient to be of significant benefit in protein structure prediction or model quality assessment. The PSICOV source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/PSICOV.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, USA )
                1932-6203
                2014
                24 March 2014
                : 9
                : 3
                : e92721
                Affiliations
                [1 ]Department of Applied Science and Technology and Center for Computational Sciences, Politecnico di Torino, Torino, Italy
                [2 ]Human Genetics Foundation-Torino, Torino, Italy
                [3 ]Sorbonne Universités, Université Pierre et Marie Curie Paris 06, UMR 7238, Computational and Quantitative Biology, Paris, France
                [4 ]Centre National de la Recherche Scientifique, UMR 7238, Computational and Quantitative Biology, Paris, France
                Technical University Darmstadt, Germany
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                Conceived and designed the experiments: CB MZ MW A. Pagnani. Performed the experiments: CB MZ CF A. Procaccini A. Pagnani. Analyzed the data: CB CF A. Procaccini A. Pagnani. Contributed reagents/materials/analysis tools: CB MZ CF A. Procaccini MW A. Pagnani. Wrote the paper: CB MZ CF A. Procaccini RZ MW A. Pagnani. Code Implementation: CB MW A. Pagnani.

                Article
                PONE-D-13-51331
                10.1371/journal.pone.0092721
                3963956
                24663061
                407b2a65-0e83-400f-9a6c-1def67662c2f
                Copyright @ 2014

                This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 5 December 2013
                : 24 February 2014
                Page count
                Pages: 12
                Funding
                Christoph Feinauer acknowledges funding from the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme FP7/2007–2013/ under REA grant agreement No. 290038. Riccardo Zecchina and Carlo Baldassi acknowledge the European Research Council for grant No. 267915. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Biochemistry
                Proteins
                Protein Structure
                Protein Folding
                Protein Structure Prediction
                Protein Interactions
                Proteomics
                Proteomic Databases
                Biophysics
                Computational Biology
                Evolutionary Modeling
                Genetics
                Genomics
                Functional Genomics
                Structural Genomics
                Molecular Biology
                Molecular Biology Techniques
                Sequencing Techniques
                Sequence Analysis
                Systems Biology
                Theoretical Biology
                Computer and Information Sciences
                Systems Science
                Complex Systems
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Probability Theory
                Bayes Theorem
                Statistics (Mathematics)
                Biostatistics
                Physics
                Statistical Mechanics

                Uncategorized
                Uncategorized

                Comments

                Comment on this article