A Comparison of Word Embeddings for the Biomedical Natural Language
  Processing

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Neural word embeddings have been widely used in biomedical Natural Language Processing (NLP) applications since they provide vector representations of words that capture the semantic properties of words and the linguistic relationship between words. Many biomedical applications use different textual sources to train word embeddings and apply these word embeddings to downstream biomedical applications. However, there has been little work on comprehensively evaluating the word embeddings trained from these resources. In this study, we provide a comprehensive empirical evaluation of word embeddings trained from four different resources, namely clinical notes, biomedical publications, Wikepedia, and news. We perform the evaluation qualitatively and quantitatively. In qualitative evaluation, we manually inspect five most similar medical words to a given set of target medical words, and then analyze word embeddings through the visualization of those word embeddings. Quantitative evaluation falls into two categories: extrinsic and intrinsic evaluation. Based on the evaluation results, we can draw the following conclusions. First, EHR and PubMed can capture the semantics of medical terms better than GloVe and Google News and find more relevant similar medical terms. Second, the medical semantic similarity captured by the word embeddings trained on EHR and PubMed are closer to human experts' judgments, compared to these trained on GloVe and Google News. Third, there does not exist a consistent global ranking of word embedding quality for downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, word embeddings trained from a similar domain corpus do not necessarily have better performance than other word embeddings for any downstream biomedical tasks.

Related collections

Most cited references 13

Record: found
Abstract: found
Article: not found

Clinical information extraction applications: A literature review

Sunghwan Sohn, Hongfang Liu, Yanshan Wang … (2018)

With the rapid adoption of electronic health records (EHRs), it is desirable to harvest information and knowledge from EHRs to support automated systems at the point of care and to enable secondary use of EHRs for clinical and translational research. One critical component used to facilitate the secondary use of EHR data is the information extraction (IE) task, which automatically extracts and encodes clinical information from text.

0 comments Cited 191 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

Placing search in context

Lev Finkelstein, Eytan Ruppin, Gadi Wolfman … (2001)

0 comments Cited 98 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: not found

Measures of semantic similarity and relatedness in the biomedical domain.

S V Pakhomov, Ted Pedersen, Siddharth V. Patwardhan … (2007)

Measures of semantic similarity between concepts are widely used in Natural Language Processing. In this article, we show how six existing domain-independent measures can be adapted to the biomedical domain. These measures were originally based on WordNet, an English lexical database of concepts and relations. In this research, we adapt these measures to the SNOMED-CT ontology of medical concepts. The measures include two path-based measures, and three measures that augment path-based measures with information content statistics from corpora. We also derive a context vector measure based on medical corpora that can be used as a measure of semantic relatedness. These six measures are evaluated against a newly created test bed of 30 medical concept pairs scored by three physicians and nine medical coders. We find that the medical coders and physicians differ in their ratings, and that the context vector measure correlates most closely with the physicians, while the path-based measures and one of the information content measures correlates most closely with the medical coders. We conclude that there is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.

0 comments Cited 75 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Publication date Created: 01 February 2018

Article

ArXiV ID: 1802.00400

SO-VID: ee91c0bc-d87b-4fca-b9a3-512e1de49cf1

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Categories cs.IR

Data availability:

Comments

Comment on this article

Cited by 4

See all cited by

Most referenced authors 199

See all reference authors

A Comparison of Word Embeddings for the Biomedical Natural Language Processing

Read this article at

Abstract

Related collections

Electrospinning for biomedical applications

Most cited references 13

Clinical information extraction applications: A literature review

Placing search in context

Measures of semantic similarity and relatedness in the biomedical domain.

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 31

Cited by 4

Most referenced authors 199