82
views
0
recommends
+1 Recommend
3 collections
    0
    shares

      Submit your digital health research with an established publisher
      - celebrating 25 years of open access

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone.

          Objective

          The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention.

          Methods

          Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020.

          Results

          Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F 1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state–level geolocations.

          Conclusions

          We have made the 13,714 tweets identified in this study, along with each tweet’s time stamp and US state–level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.

          Related collections

          Most cited references19

          • Record: found
          • Abstract: found
          • Article: not found

          The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application

          Background: A novel human coronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was identified in China in December 2019. There is limited support for many of its key epidemiologic features, including the incubation period for clinical disease (coronavirus disease 2019 [COVID-19]), which has important implications for surveillance and control activities. Objective: To estimate the length of the incubation period of COVID-19 and describe its public health implications. Design: Pooled analysis of confirmed COVID-19 cases reported between 4 January 2020 and 24 February 2020. Setting: News reports and press releases from 50 provinces, regions, and countries outside Wuhan, Hubei province, China. Participants: Persons with confirmed SARS-CoV-2 infection outside Hubei province, China. Measurements: Patient demographic characteristics and dates and times of possible exposure, symptom onset, fever onset, and hospitalization. Results: There were 181 confirmed cases with identifiable exposure and symptom onset windows to estimate the incubation period of COVID-19. The median incubation period was estimated to be 5.1 days (95% CI, 4.5 to 5.8 days), and 97.5% of those who develop symptoms will do so within 11.5 days (CI, 8.2 to 15.6 days) of infection. These estimates imply that, under conservative assumptions, 101 out of every 10 000 cases (99th percentile, 482) will develop symptoms after 14 days of active monitoring or quarantine. Limitation: Publicly reported cases may overrepresent severe cases, the incubation period for which may differ from that of mild cases. Conclusion: This work provides additional evidence for a median incubation period for COVID-19 of approximately 5 days, similar to SARS. Our results support current proposals for the length of quarantine or active monitoring of persons potentially exposed to SARS-CoV-2, although longer monitoring periods might be justified in extreme cases. Primary Funding Source: U.S. Centers for Disease Control and Prevention, National Institute of Allergy and Infectious Diseases, National Institute of General Medical Sciences, and Alexander von Humboldt Foundation.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Understanding interobserver agreement: the kappa statistic.

            Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers. Studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance. The kappa statistic (or kappa coefficient) is the most commonly used statistic for this purpose. A kappa of 1 indicates perfect agreement, whereas a kappa of 0 indicates agreement equivalent to chance. A limitation of kappa is that it is affected by the prevalence of the finding under observation. Methods to overcome this limitation have been described.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Real-time tracking of self-reported symptoms to predict potential COVID-19

              A total of 2,618,862 participants reported their potential symptoms of COVID-19 on a smartphone-based app. Among the 18,401 who had undergone a SARS-CoV-2 test, the proportion of participants who reported loss of smell and taste was higher in those with a positive test result (4,668 of 7,178 individuals; 65.03%) than in those with a negative test result (2,436 of 11,223 participants; 21.71%) (odds ratio = 6.74; 95% confidence interval = 6.31–7.21). A model combining symptoms to predict probable infection was applied to the data from all app users who reported symptoms (805,753) and predicted that 140,312 (17.42%) participants are likely to have COVID-19.
                Bookmark

                Author and article information

                Contributors
                Journal
                J Med Internet Res
                J Med Internet Res
                JMIR
                Journal of Medical Internet Research
                JMIR Publications (Toronto, Canada )
                1439-4456
                1438-8871
                January 2021
                22 January 2021
                22 January 2021
                : 23
                : 1
                : e25314
                Affiliations
                [1 ] Department of Biostatistics, Epidemiology, and Informatics Perelman School of Medicine University of Pennsylvania Philadelphia, PA United States
                Author notes
                Corresponding Author: Ari Z Klein ariklein@ 123456pennmedicine.upenn.edu
                Author information
                https://orcid.org/0000-0002-8281-3464
                https://orcid.org/0000-0002-4109-1346
                https://orcid.org/0000-0001-7709-3813
                https://orcid.org/0000-0002-1912-0112
                https://orcid.org/0000-0001-8331-3675
                https://orcid.org/0000-0002-6416-9556
                Article
                v23i1e25314
                10.2196/25314
                7834613
                33449904
                139453d8-b4b3-4779-8a40-f6bc30c88f74
                ©Ari Z Klein, Arjun Magge, Karen O'Connor, Jesus Ivan Flores Amaro, Davy Weissenbacher, Graciela Gonzalez Hernandez. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 22.01.2021.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.

                History
                : 27 October 2020
                : 5 December 2020
                : 14 December 2020
                : 14 December 2020
                Categories
                Original Paper
                Original Paper

                Medicine
                natural language processing,social media,data mining,covid-19,coronavirus,pandemics,epidemiology,infodemiology

                Comments

                Comment on this article