BioBERT: a pre-trained biomedical language representation model for biomedical text mining

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora.

Results

We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.

Availability and implementation

We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

Related collections

Most cited references 29

Record: found
Abstract: not found
Conference Proceedings: not found

Glove: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, Christopher Manning (2014)

0 comments Cited 1035 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

NCBI disease corpus: a resource for disease name recognition and concept normalization.

Zhiyong Lu, Rezarta Islamaj Doğan, Robert Leaman (2014)

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. Published by Elsevier Inc.

0 comments Cited 166 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Deep learning with word embeddings improves biomedical named entity recognition

Maryam Habibi, Leon Weber, Mariana Neves … (2017)

Abstract Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. Availability and implementation: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/. Contact: habibima@informatik.hu-berlin.de

0 comments Cited 129 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Jonathan Wren: Role: Associate Editor

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 15 February 2020

Publication date (Electronic): 10 September 2019

Publication date PMC-release: 10 September 2019

Volume: 36

Issue: 4

Pages: 1234-1240

Affiliations

[1 ] Department of Computer Science and Engineering, Korea University , Seoul 02841, Korea

[2 ] Clova AI Research, Naver Corp , Seong-Nam 13561, Korea

[3 ] Interdisciplinary Graduate Program in Bioinformatics, Korea University , Seoul 02841, Korea

Author notes

To whom correspondence should be addressed. kangj@ 123456korea.ac.kr

Jinhyuk Lee and Wonjin Yoon wish it to be known that the first two authors contributed equally.

Author information

Jinhyuk Lee http://orcid.org/0000-0003-4972-239X

Wonjin Yoon http://orcid.org/0000-0002-6435-548X

Sungdong Kim http://orcid.org/0000-0002-0240-6210

Donghyeon Kim http://orcid.org/0000-0002-8224-8354

Sunkyu Kim http://orcid.org/0000-0002-0240-6210

Chan Ho So http://orcid.org/0000-0001-7633-1074

Jaewoo Kang http://orcid.org/0000-0001-6798-9106

Article

Publisher ID: btz682

DOI: 10.1093/bioinformatics/btz682

PMC ID: 7703786

PubMed ID: 31501885

SO-VID: a71019a5-21f4-49f6-a197-f04b04acca75

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 16 May 2019

Date revision received : 29 July 2019

Date accepted : 5 September 2019

Page count

Pages: 7

Funding

Funded by: National Research Foundation of Korea(NRF) funded by the Korea government

Award ID: NRF-2017R1A2A1A17069645

Award ID: NRF-2017M3C4A7065887

Award ID: NRF-2014M3C9A3063541

Comments

Comment on this article

scite_

Cited by 711

See all cited by

Most referenced authors 663

See all reference authors

- Version 1
- Version 1

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Read this article at

Abstract

Motivation

Results

Availability and implementation

Related collections

Genetoberfest

Most cited references 29

Glove: Global Vectors for Word Representation

NCBI disease corpus: a resource for disease name recognition and concept normalization.

Deep learning with word embeddings improves biomedical named entity recognition

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 144

Cited by 711

Most referenced authors 663