ProteinBERT: a universal deep-learning model of protein sequence and function.

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.

Related collections

Most cited references 35

Record: found
Abstract: found
Article: not found

Gene Ontology: tool for the unification of biology

Michael Ashburner, Catherine A. Ball, Judith Blake … (2002)

Genomic sequencing has made it clear that a large fraction of the genes specifying the core biological functions are shared by all eukaryotes. Knowledge of the biological role of such shared proteins in one organism can often be transferred to other organisms. The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing. To this end, three independent ontologies accessible on the World-Wide Web (http://www.geneontology.org) are being constructed: biological process, molecular function and cellular component.

0 comments Cited 15445 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Basic local alignment search tool.

Stephen F Altschul, Warren Gish, Webb Miller … (1990)

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

0 comments Cited 8986 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

S Altschul (1997)

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

0 comments Cited 4133 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (iso-abbrev): Bioinformatics

Title: Bioinformatics (Oxford, England)

Publisher: Oxford University Press (OUP)

ISSN (Electronic): 1367-4811

ISSN (Print): 1367-4803

Publication date (Electronic): Apr 12 2022

Volume: 38

Issue: 8

Affiliations

[1 ] School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel.

[2 ] Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel.

[3 ] Deep Trading Ltd., Haifa 3508401, Israel.

[4 ] Department of Software and Information Systems Engineering, Faculty of Engineering Sciences, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel.

Article

Publisher Item ID: 6502274

DOI: 10.1093/bioinformatics/btac020

PMC ID: 9386727

PubMed ID: 35020807

SO-VID: fa790ab8-5286-43d1-8ea7-de034b47a084

History

Data availability:

Comments

Comment on this article

scite_

Cited by 78

See all cited by

Most referenced authors 2,427

See all reference authors

- Version 1
- Version 1

ProteinBERT: a universal deep-learning model of protein sequence and function.

Read this article at

Abstract

Related collections

Universal stem cells

Most cited references 35

Gene Ontology: tool for the unification of biology

Basic local alignment search tool.

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Author and article information

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 94

Cited by 78

Most referenced authors 2,427