Direct-coupling analysis of residue coevolution captures native contacts across many protein families

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intradomain residue contacts, arising, e.g., from alternative protein conformations, ligand-mediated residue couplings, and interdomain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, contingent on the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.

AUTHOR SUMMARY

In this work, we have shown the ability of mfDCA to identify, with high accuracy, residue pairs in domain families and found that mfDCA is not only able to identify intradomain contacts but also interdomain residue pairs that are part of oligomerization interfaces. One potential application is the identification of interaction interfaces for homodimers that could ultimately help in predicting complex structures. Our results might open unexplored avenues of research for which full contact maps could be estimated and used as input data for de novo protein structure identification, which is particularly interesting in the case of interdomain contacts in multidomain proteins. Ultimately, this methodology can be used with pairs of proteins rather than single proteins to identify potential protein–protein interactions.

The large number of contacts correctly predicted by mfDCA prompted us to explore the extent to which our method may be used to predict the contact maps of protein domains. In Fig. P1 C , we show the example of the eukaryotic H-Ras protein ( 5 ). The first approximately 800 mfDCA contact predictions are displayed; they include 320 nontrivial pairs, which are separated by at least five positions along the protein backbone. The predicted contact map captures very well the native contacts. Even if not all single contacts are predicted, almost all contact clusters are identified.

One of the most striking features of our results obtained using DCA is how gradually the average TP rate declines with increasing direct information ranking. This observation naturally led us to explore the information lying beyond the first 10–30 DCA predictions. For instance, how many residue contacts can be predicted if a TP rate of at least 70% is required? As can be seen in Fig. P1 B , this TP rate is reached on average after the top 70 DCA predictions, where one would have obtained approximately 50 true contacts on average. In extreme cases (e.g., the family of bacterial receptors called the tripartite tricarboxylate receptors), a TP rate of > 70% was present for the first 600 DCA-predicted residue pairs, therefore including more than 400 native contacts.

A third cause of high-scoring long-distance pairings inferred by mfDCA is observed in the metalloenzyme domain family, which binds to metal ions. It has a high-scoring residue pair, which is separated by more than 14 Å when mapped to the glutathione transferase FosA of Pseudomonas aeruginosa . We observed that the two residues belonging to the different subunits of this residue pair are, however, in direct contact with the manganese “Mn(II)” ion when the enzymes are paired, or in the dimer configuration. Thus, the “direct interaction” between these residues found by mfDCA is presumably mediated through their common interaction with a third agent, the metal ion in this case.

A second cause of high-scoring long-distance pairs is the occurrence of alternative domain conformations. As an illustration, we examined the domain family GerE, whose members include the DNA-binding domains of a number of proteins (response regulators) in a system known as the two-component signaling system. Using the DNA-bound domain of the nitrate/nitrite response regulator NarL of E. coli , we found that all of the top 20 DCA predictions were true contacts. However, when mapping the same pairs to the structure of the inactive, full-length response regulator of the transcriptional regulatory protein DosR of Mycobacterium tuberculosis , seven pairs were found at distances > 8 Å . A comparison between these structures shows clearly that all of the long-distance pairs involve one particular position that is significantly displaced in the full-length structure, presumably due to interaction with the (unphosphorylated) regulatory domain.

First, direct correlation may be due to contacts between proteins in oligomeric complexes (i.e., complexes comprising different proteins). One example is the ATPase domain of the nitrogen regulatory protein C family transcriptional activators, which act to promote the activity of RNA polymerase. Upon activation, different subunits of this domain are known to form a multiunit ring, wrapping DNA around the complex ( 4 ). Among the top 20 mfDCA-predicted contacts, three pairs appeared to be long-distance (> 10 Å ) within the domain; but they all turned out to be within 5 Å of the closest position in an adjacent subunit of the ring structure. More globally, out of the 131 studied domain families, 21 families feature X-ray crystal structures involving oligomers. Among the top 20 pairs that are not intradomain contacts (i.e., contacts within the same domain), about one-half turned out to be interdomain contacts (i.e., contacts between domains).

The high accuracy of mfDCA predictions led us to examine whether the small fraction of “false positives” (i.e., predicted pairings located far away in the crystal structures) might also have some biological significance. We investigated various biological reasons for the appearance of such long-distance direct correlations and found the following cases.

To test the general applicability of mfDCA, we analyzed 131 predominantly bacterial domain families. These families were selected ( i ) to be sufficiently abundant for statistical analysis (> 1,000 nonredundant sequences), and ( ii ) to feature at least two high-resolution X-ray crystal structures (resolution < 3 Å ), so that the spatial proximity between predicted residue pairs could be evaluated. The selected domain families encompassed a total of 856 different solved structures listed in the Protein Data Bank. On average, 84% of the top 20 mfDCA predictions (a minimum separation of five along the sequence) in each of the 131 families are TP contacts, whereas simple correlation analysis using mutual information reaches only a 59% TP rate ( Fig. P1 B ). The individual TP rates of mfDCA predictions for the 131 families are distributed mostly in the range of 70–100%. Analogous results were also observed for a collection of 25 abundant eukaryotic proteins.

The performance of mfDCA is first illustrated in Fig. P1 A for a specific protein structure (domain) taken from the proteome (i.e., the assembly of proteins present in an organism under specific conditions) of the bacterium Escherichia coli ( 3 ). The top 20 DCA predictions, separated by at least five positions along the sequence to exclude trivial predictions, are displayed as bonds of different color: The 19 residue pairs with distances < 8 Å are shown in red, and the lone pair with a larger interresidue distance is shown in green. The mfDCA thus gave a true-positive (TP) rate of 95% for the top 20 contacts predicted for this protein.

Due to rapid advances in the technologies for sequencing DNA, a large number of genome projects that aim to determine the complete DNA (genome) sequence of an organism are now in existence, which is particularly the case for bacterial genomes, with approximately 1,700 completed and 8,300 ongoing ( 1 ). These genome sequences can be used to provide correlated substitution patterns for a large number of common bacterial proteins and interacting protein pairs, as well as for some abundant eukaryotic, or nonbacterial, proteins. The availability of such correlations presents opportunities to make inferences on residue contacts, that is, contacts between amino acid residues within or between proteins, and to eventually predict protein structures at various levels of organization based on sequence information alone. Crucial to this inference is the ability to disentangle direct and indirect correlations (i.e., induced through intermediate residue positions), as accomplished by the recently introduced method of direct-coupling analysis (DCA) ( 2 ), based on sophisticated tools developed in statistical physics. Here, we further develop a computationally efficient implementation of DCA, called mfDCA, an algorithm based on the mean-field approximation of DCA, which is 10 ³ to 10 ⁴ times faster than the original DCA implementation, and hence can be used to rapidly analyze large numbers of families of long protein sequences.

Protein function depends crucially on the correct folding of its three-dimensional structure. The maintenance of this structure during evolution imposes strong constraints on the variability of the amino acid sequences between homologous (i.e., closely related) proteins. For example, deleterious alterations (or mutations) in one residue may have to be compensated by mutations of the residues it interacts with, leading to correlations between the amino acid compositions at different sequence positions. In essence, a particular change at one position may correlate with certain necessary changes at other positions that enable the protein to maintain its structure and function.

Related collections

Most cited references 49

Record: found
Abstract: found
Article: not found

The Protein Data Bank.

H M Berman, J Westbrook, Z. Feng … (2000)

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

0 comments Cited 3839 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Profile hidden Markov models.

S. Eddy (1998)

The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.

0 comments Cited 1283 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Information Theory and Statistical Mechanics

E. T. Jaynes (1957)

0 comments Cited 560 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Title: Proceedings of the National Academy of Sciences

Abbreviated Title: Proc. Natl. Acad. Sci. U.S.A.

Publisher: Proceedings of the National Academy of Sciences

ISSN (Print): 0027-8424

ISSN (Electronic): 1091-6490

Publication date Created: December 06 2011

Publication date (Electronic): November 21 2011

Publication date (Print): December 06 2011

Volume: 108

Issue: 49

Affiliations

[1 ]Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0374;

[2 ]Human Genetics Foundation, Via Nizza 52, 10126 Turin, Italy;

[3 ]Institute for Scientific Interchange, Viale Settimio Severo 65, 10133 Turin, Italy;

[4 ]Department of Systems Biology, Harvard Medical School, 20 Longwood Avenue, Boston, MA 02115;

[5 ]Memorial Sloan–Kettering Cancer Center, Computational Biology Center, 1275 York Avenue, New York, NY 10065;

[6 ]Center for Computational Studies and Dipartimento di Fisica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Turin, Italy;

[7 ]Center for Theoretical Biological Physics, Rice University, Houston, TX 77005-1827; and

[8 ]Laboratoire de Génomique des Microorganismes, Unité Mixte de Recherche 7238, Université Pierre et Marie Curie, 15 rue de l’École de Médecine, 75006 Paris, France

Article

DOI: 10.1073/pnas.1111471108

PMC ID: 3241805

PubMed ID: 22106262

SO-VID: 23179eef-e489-47ba-b060-2fed83aa159a

History

Data availability:

Comments

Comment on this article

scite_

Cited by 389

See all cited by

- Version 1
- Version 1

Direct-coupling analysis of residue coevolution captures native contacts across many protein families

Read this article at

Abstract

AUTHOR SUMMARY

Related collections

Software for SAXS correction and analysis

Most cited references 49

The Protein Data Bank.

Profile hidden Markov models.

Information Theory and Statistical Mechanics

Author and article information

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 109

Cited by 389