DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Recently developed methods of protein contact prediction, a crucially important step for protein structure prediction, depend heavily on deep neural networks (DNNs) and multiple sequence alignments (MSAs) of target proteins. Protein sequences are accumulating to an increasing degree such that abundant sequences to construct an MSA of a target protein are readily obtainable. Nevertheless, many cases present different ends of the number of sequences that can be included in an MSA used for contact prediction. The abundant sequences might degrade prediction results, but opportunities remain for a limited number of sequences to construct an MSA. To resolve these persistent issues, we strove to develop a novel framework using DNNs in an end-to-end manner for contact prediction.

Results

We developed neural network models to improve precision of both deep and shallow MSAs. Results show that higher prediction accuracy was achieved by assigning weights to sequences in a deep MSA. Moreover, for shallow MSAs, adding a few sequential features was useful to increase the prediction accuracy of long-range contacts in our model. Based on these models, we expanded our model to a multi-task model to achieve higher accuracy by incorporating predictions of secondary structures and solvent-accessible surface areas. Moreover, we demonstrated that ensemble averaging of our models can raise accuracy. Using past CASP target protein domains, we tested our models and demonstrated that our final model is superior to or equivalent to existing meta-predictors.

Conclusions

The end-to-end learning framework we built can use information derived from either deep or shallow MSAs for contact prediction. Recently, an increasing number of protein sequences have become accessible, including metagenomic sequences, which might degrade contact prediction results. Under such circumstances, our model can provide a means to reduce noise automatically. According to results of tertiary structure prediction based on contacts and secondary structures predicted by our model, more accurate three-dimensional models of a target protein are obtainable than those from existing ECA methods, starting from its MSA. DeepECA is available from https://github.com/tomiilab/DeepECA.

Related collections

Most cited references 29

Record: found
Abstract: found
Article: not found

Sparse inverse covariance estimation with the graphical lasso.

J. Friedman, T. Hastie, R. Tibshirani (2008)

We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm--the graphical lasso--that is remarkably fast: It solves a 1000-node problem ( approximately 500,000 parameters) in at most a minute and is 30-4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.

0 comments Cited 1440 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

PISCES: a protein sequence culling server.

Guoli Wang, Roland L. Dunbrack (2003)

PISCES is a public server for culling sets of protein sequences from the Protein Data Bank (PDB) by sequence identity and structural quality criteria. PISCES can provide lists culled from the entire PDB or from lists of PDB entries or chains provided by the user. The sequence identities are obtained from PSI-BLAST alignments with position-specific substitution matrices derived from the non-redundant protein sequence database. PISCES therefore provides better lists than servers that use BLAST, which is unable to identify many relationships below 40% sequence identity and often overestimates sequence identity by aligning only well-conserved fragments. PDB sequences are updated weekly. PISCES can also cull non-PDB sequences provided by the user as a list of GenBank identifiers, a FASTA format file, or BLAST/PSI-BLAST output.

0 comments Cited 487 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Protein 3D Structure Computed from Evolutionary Sequence Variation

Debora S. Marks, Lucy Colwell, Robert Sheridan … (2011)

The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.

0 comments Cited 443 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Kentaro Tomii:

ORCID: http://orcid.org/0000-0002-4567-4768

k-tomii@aist.go.jp

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date (Electronic): 9 January 2020

Publication date PMC-release: 9 January 2020

Publication date Collection: 2020

Volume: 21

Electronic Location Identifier: 10

Affiliations

[1 ]ISNI 0000 0001 2151 536X, GRID grid.26999.3d, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, , The University of Tokyo, ; 5-1-5 Kashiwanoha, Kashiwa-shi, Chiba-ken, 277-8562 Japan

[2 ]ISNI 0000 0001 2230 7538, GRID grid.208504.b, Artificial Intelligence Research Center (AIRC), , Biotechnology Research Institute for Drug Discovery, Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advanced Industrial Science and Technology (AIST), ; 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064 Japan

Author information

Kentaro Tomii http://orcid.org/0000-0002-4567-4768

Article

Publisher ID: 3190

DOI: 10.1186/s12859-019-3190-x

PMC ID: 6953294

PubMed ID: 31918654

SO-VID: bd9d03a6-5a97-4993-931d-8e78e04c73b1

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 16 July 2019

Date accepted : 4 November 2019

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100009619, Japan Agency for Medical Research and Development;

Award ID: JP19am0101110

Award Recipient : Kentaro Tomii

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: contact prediction,convolutional neural network,deep learning,multiple sequence alignment,protein,secondary structure prediction

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: contact prediction, convolutional neural network, deep learning, multiple sequence alignment, protein, secondary structure prediction

DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment

Read this article at

Abstract

Background

Results

Conclusions

Related collections

Genetoberfest

Most cited references 29

Sparse inverse covariance estimation with the graphical lasso.

PISCES: a protein sequence culling server.

Protein 3D Structure Computed from Evolutionary Sequence Variation

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 59

Cited by 12

Most referenced authors 370