Comment on: “Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts”

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Dear Editor, We read with great interest the article by Cocos et al. 1 In it, the authors use one of the datasets made public by our lab in parallel with a publication in Journal of the American Medical Informatics Association, 2 referred to by them as the Twitter ADR Dataset (v1.0) (henceforth the ADRMine Dataset). Cocos et al use state-of-the-art recurrent neural network (RNN) models for extracting adverse drug reaction (ADR) mentions in Twitter posts. We commend the authors for their clear description of the workings of neural models, and on their experiments on the use of fixed versus trainable embeddings, which can be very valuable to the natural language processing (NLP) research community. We believe that using deep learning models offer greater opportunities for mining ADR posts on social media. However, there are key choices made by the authors that require clarification to avoid a misunderstanding on the impact of their findings. In a nutshell, because the authors did not use the ADRMine Dataset in its entirety, discarding upfront all tweets with no human annotations (ie, those that do not contain any ADRs), the resulting train and test sets are biased toward the positive class. Thus, the performance measures reported for the task in Cocos et al are not comparable to those reported in Nikfarjam et al, 2 contrary to what the manuscript reports. After discarding tweets with no human annotation from the ADRMine Dataset, the authors downloaded available tweets from Twitter, and added a small set (203 tweets) to form the dataset used for their experiments. While downloading from Twitter results in an almost unavoidable reduction in the dataset size—as not all tweets are available as time goes by—it would not generally affect the class balance. The elimination of the tweets with no human annotations from the ADRMine Dataset, however, is a choice that is not discussed by Cocos et al, even though it severely impacts the positive-to-negative class balance of the dataset, leaving it at the 95 to 5 that they report, and, as our experiments show, has a significant impact on the reported performance. Our comparisons of ADRMine with the system proposed by Cocos et al reveal that, actually, when the two systems are employed on the dataset with the original balance, ADRMine 2 performs significantly better than their proposed approach (last two rows of Table 1). Thus, the claim in the Results and Conclusion sections of Cocos et al that their model “represents new state-of-the-art performance” and that “RNN models … establish new state-of-the-art performance by achieving statistically significant superior F-measure performance compared to the CRF-based model” is premature. We expand on these points next. Table 1. Performance comparison of NERs under different training and testing modes Mode Dataset Size Precision Recall F1-score Cocos et al on MostlyPos dataset as published 844 tweets 0.70 (0.66-0.74) 0.82 (0.76-0.89) 0.75 (0.74-0.76) October 2018: train MostlyPos and test MostlyPos 526 tweets 0.76 (0.70-0.82) 0.72 (0.63-0.81) 0.73 (0.70-0.76) October 2018: train MostlyPos and test Standard 644 tweets 0.60 (0.54-0.65) 0.70 (0.62-0.77) 0.63 (0.60-0.66) October 2018: train Standard and test Standard 1012 tweets 0.73 (0.66-0.79) 0.60 (0.52-0.68) 0.64 (0.62-0.66) Cocos et al on ADRMine Dataset 1784 tweets 0.68 (0.62-0.73) 0.69 (0.62-0.75) 0.67 (0.66-0.69) ADRMine on ADRMine Dataset as published 1 1784 tweets 0.76 0.68 0.72 Values are mean (95% confidence interval). Scores were achieved by each model over 10 training and evaluation rounds. MostlyPos refers to how the dataset is used by Cocos et al (ie, removing tweets without span annotations), hence leaving mostly positive tweets. Standard refers to the dataset including a roughly 50-50 balance of positive to negative tweets as in Nikfarjam et al, 2 and the balance of the ADRMine Dataset. To give some context to the ADRMine dataset, it contains a set of tweets collected on medication name as a keyword. Retweets were removed, and tweets with a URL were omitted, given that our analysis showed that they were mostly advertisements. To balance the data in a way that reflected what was automatically possible at the time, a binary classifier with precision around 0.4-0.5 was assumed. Thus, negative (non-ADR) instances were kept at around 50%, down from approximately 89% non-ADR tweets that come naturally when collecting on medication name as a keyword, 2 a balance one would expect for this task utilizing state-of-the-art automatic methods for classification before attempting extraction. It is thus a realistic, justified, balance. Regarding the Cocos et al approach, although controlled experiments training with different ratios of class examples are not unusual in machine learning, results for different positive-to-negative ratios are usually reported and are noted upfront. Cocos et al use a 95-to-5 positive-to-negative split, and only report on the performance on this altered dataset, making no mention of the alteration or class imbalance in the abstract. The statement in the abstract summarizes their results as follows: “Our best-performing RNN model … achieved an approximate match F-measure of 0.755 for ADR identification on the dataset, compared to 0.631 for a baseline lexicon system and 0.65 for the state-of-the-art conditional random fields model.” Although further in the manuscript Cocos et al refer to having implemented a CRF model “as described for previous state-of-the-art results,” citing Nikfarjam et al, 2 the statement in the abstract could be misconstrued as directly comparing it to Nikfarjam et al, which is the state-of-the-art conditional random fields (CRF) model. In reality, the results are not comparable, given the changes to the dataset. Their implementation of a CRF model must have been significantly different to ADRMine as described in Nikfarjam et al, given that the reported performance in Cocos et al for a CRF model (0.65) is much lower than when both systems are used on the unaltered ADRMine Dataset, as our experiments show (last two rows of Table 1). 2 Please note that Cocos et al did not make available their CRF model implementation, so any differences to the ADRMine model could not be verified directly, only inferred from the reported results. The binaries of ADRMine were available at the time of publication, and we have since made available the full code to facilitate reproducibility. a In machine learning research, authors decide how the model is trained and how the data are algorithmically filtered before training, apply accepted practices for balancing the data, or include additional weakly supervised examples. 3 However, such methods are applied to the training data only, leaving the evaluation data intact in order to be able to compare approaches. By excluding tweets that are negative for the presence of ADRs and other entities from their training, the authors built a model that is biased to the positive class. This might not be immediately obvious in Cocos et al, as the model is evaluated against a similarly biased test set. However, when run against the balanced test set, the problem becomes evident. The authors do note this, stating that “including a significant number of posts without ADRs in the training data produced relatively poor results for both the RNN and baseline models,” but they did not include a report of these results or altered their experimental approach to make this more evident. To illustrate the impact of the dataset modifications on the overall results, we ran the training and evaluation experiments on the ADRMine Dataset for tweets available as of October 2018 using the authors’ publicly available implementation b and summarize the results in Table 1. Under the same settings as Cocos et al (eliminating virtually all tweets in the negative class), the performance reported (row 1) and our replication (row 2) can be considered a match with a slight drop that could be attributed to fewer tweets available as of October 2018 compared with when they ran it. However, evaluating the Cocos et al model on the balanced test set (row 3) shows a drop of 10 percentage points compared with evaluating against the mostly positive set (row 2). Training on all available positive and negative tweets from the October 2018 set (row 4) leads to an improved model but continues to show significantly lower performance (0.64) with respect to when the same model is trained and tested on the biased set (0.73 in row 2). Additionally, and to be able to do a direct comparison, we trained and tested the Cocos et al system as provided by them (except for the download script) on the original, balanced, ADRMine Dataset containing 1784 tweets. We found a mean performance of 0.67 over 10 runs (row 5), 5 points lower than the 0.72 F1-score reported in Nikfarjam et al on the same dataset (row 6). 2 Furthermore, referring to the ADRMine Dataset, 2 Cocos et al report, “Of the 957 identifiers in the original dataset…,” which is incorrect. The original dataset, publicly available and unchanged since its first publication in 2015, contains a total of 1784 tweets (1340 in the training set and 444 in the evaluation or test set). As of October 2018, 1012 of the 1784 original training set tweets were still available in Twitter (including 267 of the 444 original evaluation tweets). Cocos et al do not mention the additional 827 tweets that were in the ADRMine Dataset, even though many of them were still available at the time of their publication. They used only 149 tweets from the 444 in the evaluation set. From our analysis, the 957 mentioned in Cocos et al correspond to the number of tweets in the ADRMine Dataset that are manually annotated for the presence of ADRs and other entities, such as indications, drug, and other (miscellaneous) entities. The rest (827 tweets) mentioning medications but with no other entities present, are discarded upfront, as can be observed by running Cocos et al’s code, the download_tweets.py script. Although the Cocos et al code points researchers to the original site to download the ADRMine Dataset, once they move on to the said script with that data, they lose all the unannotated negative tweets. The authors do not discuss the rationale as to why the dataset was modified in such a manner. From the time that Cocos et al was published, subsequent papers have also used the 95-to-5 positive-to-negative split, presumably because they reuse the python script. 4–7 We have made available with this letter, a modification to the download_tweets.py script that will keep previously discarded tweets. c In conclusion, the performance reported for the RNN model in Cocos et al is not comparable to any prior published approach, and in effect, when trained and tested with the full dataset, its performance (0.64) is significantly lower than the state of the art for the task (0.72). 2 ADR mentions are very rare events on social media, as has become evident through shared tasks on ADR detection in social media. Even after three years, the best classifier reaches only a precision of 0.44, recall of 0.63, for an F-measure of 0.52. 8 The upfront stripping of negative examples, whereby 95% of the dataset contains at least 1 ADR or indication mention, as done in Cocos et al, results in an extremely biased dataset, which in turn results in a model biased to the positive class that does not reflect any realistic deployment of a solution to the original problem. FUNDING This work was supported by National Institutes of Health National Library of Medicine grant number 5R01LM011176. The content is solely the responsibility of the authors and does not necessarily represent the ofﬁcial views of the National Library of Medicine or National Institutes of Health. AUTHOR CONTRIBUTORS AM first noted the data use problem, ran the experiments and wrote the initial draft of the manuscript. AS and AN contributed to some sections and made edits to the manuscript. GG designed the experiments and wrote the final version of the manuscript. Conflict of interest statement: None declared.

Related collections

Most cited references 3

Record: found
Abstract: not found
Article: not found

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches

M Galar, A. Fernández, E. Barrenechea … (2012)

0 comments Cited 325 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Azadeh Nikfarjam, Abeed Sarker, Karen O’Connor … (2015)

Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words’ semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.

0 comments Cited 130 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction

Shashank Gupta, Sachin S. Pawar, Nitin Ramrakhiyani … (2018)

Background Social media is a useful platform to share health-related information due to its vast reach. This makes it a good candidate for public-health monitoring tasks, specifically for pharmacovigilance. We study the problem of extraction of Adverse-Drug-Reaction (ADR) mentions from social media, particularly from Twitter. Medical information extraction from social media is challenging, mainly due to short and highly informal nature of text, as compared to more technical and formal medical reports. Methods Current methods in ADR mention extraction rely on supervised learning methods, which suffer from labeled data scarcity problem. The state-of-the-art method uses deep neural networks, specifically a class of Recurrent Neural Network (RNN) which is Long-Short-Term-Memory network (LSTM). Deep neural networks, due to their large number of free parameters rely heavily on large annotated corpora for learning the end task. But in the real-world, it is hard to get large labeled data, mainly due to the heavy cost associated with the manual annotation. Results To this end, we propose a novel semi-supervised learning based RNN model, which can leverage unlabeled data also present in abundance on social media. Through experiments we demonstrate the effectiveness of our method, achieving state-of-the-art performance in ADR mention extraction. Conclusion In this study, we tackle the problem of labeled data scarcity for Adverse Drug Reaction mention extraction from social media and propose a novel semi-supervised learning based method which can leverage large unlabeled corpus available in abundance on the web. Through empirical study, we demonstrate that our proposed method outperforms fully supervised learning based baseline which relies on large manually annotated corpus for a good performance.

0 comments Cited 11 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): J Am Med Inform Assoc

Journal ID (iso-abbrev): J Am Med Inform Assoc

Journal ID (publisher-id): jamia

Title: Journal of the American Medical Informatics Association : JAMIA

Publisher: Oxford University Press

ISSN (Print): 1067-5027

ISSN (Electronic): 1527-974X

Publication date Collection: June 2019

Publication date (Electronic): 11 April 2019

Publication date PMC-release: 11 April 2019

Volume: 26

Issue: 6

Pages: 577-579

Affiliations

[1 ]College of Health Solutions, Arizona State University, Scottsdale, Arizona, USA

[2 ]Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA

Author notes

Email: gragon@ 123456pennmedicine.upenn.edu

Author information

Abeed Sarker http://orcid.org/0000-0001-7358-544X

Article

Publisher ID: ocz013

DOI: 10.1093/jamia/ocz013

PMC ID: 6515520

PubMed ID: 31087070

SO-VID: 0deefc7a-80e4-4344-8809-66a79ae4ce16

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

History

Date received : 28 November 2018

Date accepted : 21 January 2019

Page count

Pages: 3

Funding

Funded by: National Institutes of Health National Library of Medicine

Award ID: 5R01LM011176

Funded by: National Library of Medicine or National Institutes of Health

Comments

Comment on this article

scite_

Cited by 3

See all cited by

Most referenced authors 33

See all reference authors

Comment on: “Deep learning for pharmacovigilance: recurrent neural network architectures for labeling adverse drug reactions in Twitter posts”

Read this article at

Abstract

Related collections

The Science of Twitter

Most cited references 3

A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches

Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features

Semi-Supervised Recurrent Neural Network for Adverse Drug Reaction mention extraction

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 9

Cited by 3

Most referenced authors 33