ProteinBERT: a universal deep-learning model of protein sequence and function

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Summary

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.

Availability and implementation

Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.

Supplementary information

Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 35

Record: found
Abstract: found
Article: not found

Gene Ontology: tool for the unification of biology

Michael Ashburner, Catherine A. Ball, Judith Blake … (2002)

Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

0 comments Cited 15445 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Basic local alignment search tool.

Stephen F Altschul, Warren Gish, Webb Miller … (1990)

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

0 comments Cited 8986 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

S Altschul (1997)

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

0 comments Cited 4133 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Nadav Brandes: (View ORCID Profile)

Nadav Rappoport: (View ORCID Profile)

Michal Linial: (View ORCID Profile)

Journal

Title: Bioinformatics

Publisher: Oxford University Press (OUP)

ISSN (Print): 1367-4803

ISSN (Electronic): 1460-2059

Publication date Created: April 15 2022

Publication date Created: April 12 2022

Publication date Created: February 10 2022

Publication date Other: April 15 2022

Publication date (Print): April 12 2022

Publication date (Electronic): February 10 2022

Volume: 38

Issue: 8

Pages: 2102-2110

Affiliations

[1 ]School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel

[2 ]Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel

[3 ]Deep Trading Ltd., Haifa 3508401, Israel

[4 ]Department of Software and Information Systems Engineering, Faculty of Engineering Sciences, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel

Article

DOI: 10.1093/bioinformatics/btac020

SO-VID: fa790ab8-5286-43d1-8ea7-de034b47a084

License:

https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model

ProteinBERT: a universal deep-learning model of protein sequence and function

Read this article at

Abstract

Summary

Availability and implementation

Supplementary information

Related collections

Universal stem cells

Most cited references 35

Gene Ontology: tool for the unification of biology

Basic local alignment search tool.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Author and article information

Contributors

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 146

Cited by 78

Most referenced authors 2,427