59
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      BioBERT: a pre-trained biomedical language representation model for biomedical text mining

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora.

          Results

          We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.

          Availability and implementation

          We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

          Related collections

          Most cited references29

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Glove: Global Vectors for Word Representation

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            NCBI disease corpus: a resource for disease name recognition and concept normalization.

            Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. Published by Elsevier Inc.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Deep learning with word embeddings improves biomedical named entity recognition

              Abstract Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. Availability and implementation: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/. Contact: habibima@informatik.hu-berlin.de
                Bookmark

                Author and article information

                Contributors
                Role: Associate Editor
                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                15 February 2020
                10 September 2019
                10 September 2019
                : 36
                : 4
                : 1234-1240
                Affiliations
                [1 ] Department of Computer Science and Engineering, Korea University , Seoul 02841, Korea
                [2 ] Clova AI Research, Naver Corp , Seong-Nam 13561, Korea
                [3 ] Interdisciplinary Graduate Program in Bioinformatics, Korea University , Seoul 02841, Korea
                Author notes
                To whom correspondence should be addressed. kangj@ 123456korea.ac.kr

                Jinhyuk Lee and Wonjin Yoon wish it to be known that the first two authors contributed equally.

                Author information
                http://orcid.org/0000-0003-4972-239X
                http://orcid.org/0000-0002-6435-548X
                http://orcid.org/0000-0002-0240-6210
                http://orcid.org/0000-0002-8224-8354
                http://orcid.org/0000-0002-0240-6210
                http://orcid.org/0000-0001-7633-1074
                http://orcid.org/0000-0001-6798-9106
                Article
                btz682
                10.1093/bioinformatics/btz682
                7703786
                31501885
                a71019a5-21f4-49f6-a197-f04b04acca75
                © The Author(s) 2019. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 16 May 2019
                : 29 July 2019
                : 5 September 2019
                Page count
                Pages: 7
                Funding
                Funded by: National Research Foundation of Korea(NRF) funded by the Korea government
                Award ID: NRF-2017R1A2A1A17069645
                Award ID: NRF-2017M3C4A7065887
                Award ID: NRF-2014M3C9A3063541
                Categories
                Original Papers
                Data and Text Mining

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article