A neural network multi-task learning approach to biomedical named entity recognition

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Named Entity Recognition (NER) is a key task in biomedical text mining. Accurate NER systems require task-specific, manually-annotated datasets, which are expensive to develop and thus limited in size. Since such datasets contain related but different information, an interesting question is whether it might be possible to use them together to improve NER performance. To investigate this, we develop supervised, multi-task, convolutional neural network models and apply them to a large number of varied existing biomedical named entity datasets. Additionally, we investigated the effect of dataset size on performance in both single- and multi-task settings.

Results

We present a single-task model for NER, a Multi-output multi-task model and a Dependent multi-task model. We apply the three models to 15 biomedical datasets containing multiple named entities including Anatomy, Chemical, Disease, Gene/Protein and Species. Each dataset represent a task. The results from the single-task model and the multi-task models are then compared for evidence of benefits from Multi-task Learning.

With the Multi-output multi-task model we observed an average F-score improvement of 0.8% when compared to the single-task model from an average baseline of 78.4%. Although there was a significant drop in performance on one dataset, performance improves significantly for five datasets by up to 6.3%. For the Dependent multi-task model we observed an average improvement of 0.4% when compared to the single-task model. There were no significant drops in performance on any dataset, and performance improves significantly for six datasets by up to 1.1%.

The dataset size experiments found that as dataset size decreased, the multi-output model’s performance increased compared to the single-task model’s. Using 50, 25 and 10% of the training data resulted in an average drop of approximately 3.4, 8 and 16.7% respectively for the single-task model but approximately 0.2, 3.0 and 9.8% for the multi-task model.

Conclusions

Our results show that, on average, the multi-task models produced better NER results than the single-task models trained on a single NER dataset. We also found that Multi-task Learning is beneficial for small datasets. Across the various settings the improvements are significant, demonstrating the benefit of Multi-task Learning for this task.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-017-1776-8) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 45

Record: found
Abstract: found
Article: found

Is Open Access

Evaluation of time profile reconstruction from complex two-color microarray designs

Ana Fierro, Raphaël Thuret, Kristof Engelen … (2008)

Background As an alternative to the frequently used "reference design" for two-channel microarrays, other designs have been proposed. These designs have been shown to be more profitable from a theoretical point of view (more replicates of the conditions of interest for the same number of arrays). However, the interpretation of the measurements is less straightforward and a reconstruction method is needed to convert the observed ratios into the genuine profile of interest (e.g. a time profile). The potential advantages of using these alternative designs thus largely depend on the success of the profile reconstruction. Therefore, we compared to what extent different linear models agree with each other in reconstructing expression ratios and corresponding time profiles from a complex design. Results On average the correlation between the estimated ratios was high, and all methods agreed with each other in predicting the same profile, especially for genes of which the expression profile showed a large variance across the different time points. Assessing the similarity in profile shape, it appears that, the more similar the underlying principles of the methods (model and input data), the more similar their results. Methods with a dye effect seemed more robust against array failure. The influence of a different normalization was not drastic and independent of the method used. Conclusion Including a dye effect such as in the methods lmbr_dye, anovaFix and anovaMix compensates for residual dye related inconsistencies in the data and renders the results more robust against array failure. Including random effects requires more parameters to be estimated and is only advised when a design is used with a sufficient number of replicates. Because of this, we believe lmbr_dye, anovaFix and anovaMix are most appropriate for practical use.

0 comments Cited 347 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

Learning deep structured semantic models for web search using clickthrough data

Po-Sen Huang, Xiaodong He, Jianfeng Gao … (2013)

0 comments Cited 177 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

NCBI disease corpus: a resource for disease name recognition and concept normalization.

Zhiyong Lu, Rezarta Islamaj Doğan, Robert Leaman (2014)

Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora. This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency. The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks. The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/. Published by Elsevier Inc.

0 comments Cited 162 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Gamal Crichton:

ORCID: http://orcid.org/0000-0002-3036-0811

gkoc2@cam.ac.uk

Sampo Pyysalo: sampo@pyysalo.net

Billy Chiu: hwc25@cam.ac.uk

Anna Korhonen: alk23@cam.ac.uk

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date (Electronic): 15 August 2017

Publication date PMC-release: 15 August 2017

Publication date Collection: 2017

Volume: 18

Electronic Location Identifier: 368

Affiliations

ISNI 0000000121885934, GRID grid.5335.0, , Language Technology Laboratory, DTAL, University of Cambridge, ; 9 West Road, Cambridge, CB39DB UK

Author information

Gamal Crichton http://orcid.org/0000-0002-3036-0811

Article

Publisher ID: 1776

DOI: 10.1186/s12859-017-1776-8

PMC ID: 5558737

PubMed ID: 28810903

SO-VID: b1c6f077-ba8c-4466-8919-a00ec9b6b763

License:

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 21 February 2017

Date accepted : 31 July 2017

Funding

Funded by: FundRef http://dx.doi.org/10.13039/501100000265, Medical Research Council;

Award ID: MR/M013049/1

Award Recipient : Billy Chiu

Funded by: FundRef http://dx.doi.org/10.13039/501100003343, Cambridge Commonwealth, European and International Trust;

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: multi-task learning,convolutional neural networks,named entity recognition,biomedical text mining

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: multi-task learning, convolutional neural networks, named entity recognition, biomedical text mining

A neural network multi-task learning approach to biomedical named entity recognition

Read this article at

Abstract

Background

Results

Conclusions

Electronic supplementary material

Related collections

Genetoberfest

Most cited references 45

Evaluation of time profile reconstruction from complex two-color microarray designs

Learning deep structured semantic models for web search using clickthrough data

NCBI disease corpus: a resource for disease name recognition and concept normalization.

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 85

Cited by 51

Most referenced authors 1,178