GPCR-BERT: Interpreting Sequential Design of G Protein-Coupled Receptors Using Protein Language Models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

With the rise of transformers and large language models (LLMs) in chemistry and biology, new avenues for the design and understanding of therapeutics have been opened up to the scientific community. Protein sequences can be modeled as language and can take advantage of recent advances in LLMs, specifically with the abundance of our access to the protein sequence data sets. In this letter, we developed the GPCR-BERT model for understanding the sequential design of G protein-coupled receptors (GPCRs). GPCRs are the target of over one-third of Food and Drug Administration-approved pharmaceuticals. However, there is a lack of comprehensive understanding regarding the relationship among amino acid sequence, ligand selectivity, and conformational motifs (such as NPxxY, CWxP, and E/DRY). By utilizing the pretrained protein model (Prot-Bert) and fine-tuning with prediction tasks of variations in the motifs, we were able to shed light on several relationships between residues in the binding pocket and some of the conserved motifs. To achieve this, we took advantage of attention weights and hidden states of the model that are interpreted to extract the extent of contributions of amino acids in dictating the type of masked ones. The fine-tuned models demonstrated high accuracy in predicting hidden residues within the motifs. In addition, the analysis of embedding was performed over 3D structures to elucidate the higher-order interactions within the conformations of the receptors.

Related collections

Most cited references 70

Record: found
Abstract: found
Article: found

Is Open Access

Highly accurate protein structure prediction with AlphaFold

John Jumper, Richard Evans, Alexander Pritzel … (2021)

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

0 comments Cited 8583 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Deep learning.

Yann LeCun, Yoshua Bengio, Geoffrey E Hinton (2015)

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

0 comments Cited 8492 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

S Altschul (1997)

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

0 comments Cited 4083 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): J Chem Inf Model

Journal ID (iso-abbrev): J Chem Inf Model

Journal ID (publisher-id): ci

Journal ID (coden): jcisd8

Title: Journal of Chemical Information and Modeling

Publisher: American Chemical Society

ISSN (Print): 1549-9596

ISSN (Electronic): 1549-960X

Publication date (Electronic): 10 February 2024

Publication date Collection: 26 February 2024

Volume: 64

Issue: 4

Pages: 1134-1144

Affiliations

[† ]Department of Chemical Engineering, Carnegie Mellon University , Pittsburgh, Pennsylvania 15213, United States

[‡ ]Department of Mechanical Engineering, Carnegie Mellon University , Pittsburgh, Pennsylvania 15213, United States

[§ ]Department of Biomedical Engineering, Carnegie Mellon University , Pittsburgh, Pennsylvania 15213, United States

[∥ ]Machine Learning Department, Carnegie Mellon University , Pittsburgh, Pennsylvania 15213, United States

Author notes

[* ]Email: barati@ 123456cmu.edu .

Author information

Seongwon Kim https://orcid.org/0009-0007-7092-5497

Parisa Mollaei https://orcid.org/0000-0002-4711-9012

Rishikesh Magar https://orcid.org/0000-0001-6216-0518

Amir Barati Farimani https://orcid.org/0000-0002-2952-8576

Article

DOI: 10.1021/acs.jcim.3c01706

PMC ID: 10900288

PubMed ID: 38340054

SO-VID: fe5b695e-4763-4ad9-b3ad-84627f43ad90

License:

Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained ( https://creativecommons.org/licenses/by/4.0/).

History

Date received : 22 October 2023

Date accepted : 29 January 2024

Date revision received : 29 January 2024

Funding

Funded by: Carnegie Mellon University, doi 10.13039/100008047;

Award ID: NA

Funded by: Center for Machine Learning and Health, School of Computer Science, Carnegie Mellon University, doi 10.13039/100018489;

Award ID: NA

Custom metadata

document-id-old-9 ci3c01706

document-id-new-14 ci3c01706

ccc-price

ScienceOpen disciplines: Computational chemistry & Modeling

Data availability:

ScienceOpen disciplines: Computational chemistry & Modeling

GPCR-BERT: Interpreting Sequential Design of G Protein-Coupled Receptors Using Protein Language Models

Read this article at

Abstract

Related collections

ACS: COVID-19 Coronavirus

Most cited references 70

Highly accurate protein structure prediction with AlphaFold

Deep learning.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 297

Most referenced authors 2,428