12
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Forecasting influenza epidemics by integrating internet search queries and traditional surveillance data with the support vector machine regression model in Liaoning, from 2011 to 2015

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Influenza epidemics pose significant social and economic challenges in China. Internet search query data have been identified as a valuable source for the detection of emerging influenza epidemics. However, the selection of the search queries and the adoption of prediction methods are crucial challenges when it comes to improving predictions. The purpose of this study was to explore the application of the Support Vector Machine (SVM) regression model in merging search engine query data and traditional influenza data.

          Methods

          The official monthly reported number of influenza cases in Liaoning province in China was acquired from the China National Scientific Data Center for Public Health from January 2011 to December 2015. Based on Baidu Index, a publicly available search engine database, search queries potentially related to influenza over the corresponding period were identified. An SVM regression model was built to be used for predictions, and the choice of three parameters ( C, γ, ε) in the SVM regression model was determined by leave-one-out cross-validation (LOOCV) during the model construction process. The model’s performance was evaluated by the evaluation metrics including Root Mean Square Error, Root Mean Square Percentage Error and Mean Absolute Percentage Error.

          Results

          In total, 17 search queries related to influenza were generated through the initial query selection approach and were adopted to construct the SVM regression model, including nine queries in the same month, three queries at a lag of one month, one query at a lag of two months and four queries at a lag of three months. The SVM model performed well when with the parameters ( C = 2, γ = 0.005, ɛ = 0.0001), based on the ensemble data integrating the influenza surveillance data and Baidu search query data.

          Conclusions

          The results demonstrated the feasibility of using internet search engine query data as the complementary data source for influenza surveillance and the efficiency of SVM regression model in tracking the influenza epidemics in Liaoning.

          Related collections

          Most cited references30

          • Record: found
          • Abstract: found
          • Article: not found

          Using internet searches for influenza surveillance.

          The Internet is an important source of health information. Thus, the frequency of Internet searches may provide information regarding infectious disease activity. As an example, we examined the relationship between searches for influenza and actual influenza occurrence. Using search queries from the Yahoo! search engine ( http://search.yahoo.com ) from March 2004 through May 2008, we counted daily unique queries originating in the United States that contained influenza-related search terms. Counts were divided by the total number of searches, and the resulting daily fraction of searches was averaged over the week. We estimated linear models, using searches with 1-10-week lead times as explanatory variables to predict the percentage of cultures positive for influenza and deaths attributable to pneumonia and influenza in the United States. With use of the frequency of searches, our models predicted an increase in cultures positive for influenza 1-3 weeks in advance of when they occurred (P < .001), and similar models predicted an increase in mortality attributable to pneumonia and influenza up to 5 weeks in advance (P < .001). Search-term surveillance may provide an additional tool for disease surveillance.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Reassessing Google Flu Trends Data for Detection of Seasonal and Pandemic Influenza: A Comparative Epidemiological Study at Three Geographic Scales

            Introduction Influenza remains a paradox for public health: While influenza epidemics are expected seasonally in temperate climates, their exact timing and severity remain largely unpredictable, making them a challenge to ongoing preparedness, surveillance and response efforts [1]. Surveillance efforts for influenza seek to determine the timing and impact of disease through characterizing information on reported illnesses, hospitalizations, deaths, and circulating influenza viruses [2]. Since establishment of the first computerized disease surveillance network nearly three decades ago [3]–[5], the use of information and communications technology for public health disease monitoring has progressed and expanded. During the last decade, the use of electronic syndromic surveillance systems have allowed for automated, detailed, high volume data collection and analysis in near-real time [6]–[9]. In parallel, novel approaches based on influenza-related internet search queries have been reported to yield faster detection and estimation of the intensity of influenza epidemics [10]–[16]. The public health utility of such systems for prospective monitoring and forecasting of influenza activity, however, remains unclear [17]–[21], particularly as occurred during the 2009 pandemic and the 2012/2013 epidemic season [22]–[24]. In November 2008, Google began prospectively monitoring search engine records using a proprietary computational search term query model called Google Flu Trends (GFT) to estimate national, regional and state level ILI activity in the United States (US) [12]. The goal of GFT was to achieve early detection and accurate estimation of epidemic influenza intensity [13]. The original GFT model was built by fitting linear regression models to weekly counts for each of the 50 million most common search queries, from the billions of individual searches submitted in the US between 2003 and 2007 [13]. An automated query selection process identified the exact text searches that yielded the highest correlations with national and regional influenza-like-illnesses (ILI) surveillance in the US during the period of model fitting; the top scoring 45 search terms constituted the original GFT ILI search definition. The GFT search algorithm was revised in the autumn of 2009, following the emergence and rapid spread of the pandemic A/H1N1pdm09 influenza virus in the US, which had gone wholly undetected by the GFT system. The updated GFT model used surveillance data from the first 20 weeks of the pandemic and a qualitative decision process with less restrictive criteria for additional ILI-related search terms to be included [14]. By September 2009 the historical GFT model was replaced with retrospective estimates from the revised algorithm. Currently, the updated GFT model provides real-time estimates of influenza intensity at three geographic scales in the US: national, state and select local cities, as well as estimates for many countries worldwide [16]. The original and updated GFT models have both shown high retrospective correlation with national and regional ILI disease surveillance data [13], [14]; however, the prospective accuracy of this surveillance tool remains unclear, even though GFT estimates are used in forecasting models for influenza incidence [15], [20], [21]. We present a comparative analysis of traditional public health ILI surveillance data and GFT estimates for ten influenza seasons to assess the retrospective and prospective performances of GFT to capture season-to-season epidemic timing and magnitude. Methods Public Health ILI Surveillance and Internet Search Query Data We compared weekly ILI and GFT data from June 1, 2003 through March 30, 2013, a period of ten influenza seasons which included a range of mild and moderately severe seasonal influenza epidemics as well as the emergence of the first influenza pandemic in over forty years. The surveillance systems were assessed at three geographical levels: entire US, Mid-Atlantic region (New Jersey, New York and Pennsylvania) and New York City. All public health surveillance data used in the study came from systems operating prospectively on a daily or weekly basis throughout the study period [2], [25]–[27]. Nationwide and regional ILI surveillance data were compiled from the US Centers for Disease Control and Prevention (CDC) sentinel ILI-Net surveillance system, which includes sources ranging from small physician practices to large electronic syndromic surveillance networks [2]. The CDC ILI-Net system is publically available each week, typically on Friday for the previous week ending Saturday during the respiratory season (October to May), with a recognized reporting lag of 1–2 weeks [2], [13]. Local ILI data came from the New York City Department of Health and Mental Hygiene (DOHMH) emergency department (ED) syndromic surveillance system, which is collected and analyzed daily, with a reporting lag of about one day [25]–[27]. In each system, all weekly public health surveillance ILI proportions were calculated as total ILI visits divided by all visits each week. Internet search query data came from the original [13] and updated GFT models [14], using weekly estimates available online [16] from both the periods of retrospective model-fitting (4 seasons for the original model and 6 seasons for the updated model) and prospective operation for both models (1 season and 4 seasons, respectively; Table 1). Finalized weekly GFT estimates were publically available each Sunday for the previous week, with a reporting lag of about one day. The original and updated GFT models used scaled measures of ILI-related searches to be directly comparable to the weighted ILI proportions from the CDC ILI-Net system [2], [13], [14], [16] (Figure 1). For additional details on data sources, see Supporting Information. 10.1371/journal.pcbi.1003256.g001 Figure 1 Time-series of weekly influenza-like illness (ILI) surveillance and Google Flu Trends (GFT) search query estimates, June 2003–March 2013. Observed weekly ILI proportions (black lines) are shown with Serfling model baseline (gray lines) and 95% epidemic threshold (dashed lines). The periods of the early wave of the 2009 pandemic and the 2012/2013 epidemic are shaded in grey. Sentinel ILI-Net surveillance is shown for (A) the United States and (B) Mid-Atlantic States (New Jersey, New York, Pennsylvania). Local ILI surveillance from emergency department visits is shown for (C) New York City. Scaled GFT internet search query estimates are shown for model-fitting periods for the original (thin red line) and updated (thin blue line) GFT models, and for the periods of prospective operation of the original (thick red line) and updated (thick blue line) GFT models. For Mid-Atlantic States the updated GFT model data represents ILI proportions only for New Jersey and New York (see Supporting Information). 10.1371/journal.pcbi.1003256.t001 Table 1 Retrospective and prospective performance of original and updated Google Flu Trends (GFT) algorithm compared with national (United States), regional (Mid-Atlantic States) and local (New York City) weekly influenza-like illness (ILI) surveillance data, 2003–2013. Time Period and Geographic Location Original GFT modela Updated GFT modelb R2 R2 National Retrospective GFT model-fitting period 0.91 0.94 Prospective GFT model period 0.64 0.73 All study weeks 0.86 0.77 Mid-Atlantic Retrospective GFT model-fitting period 0.79 0.77 Prospective GFT model period 0.27 0.57 All study weeks 0.64 0.64 New York Retrospective GFT model-fitting period 0.89 0.51 Prospective GFT model period 0.03 0.77 All study weeks 0.34 0.41 Performance was evaluated by linear regression of weekly GFT estimates against weekly ILI surveillance. a Original GFT model time periods: The retrospective query selection model-fitting period was from September 28, 2003 through March 17, 2007; the prospective GFT model validation period was from March 18, 2007 through May 17, 2008 and ongoing operation was from May 18, 2008 through Aug 1, 2009. Mid-Atlantic region states included NJ, NY and PA (13). New York comparison was based on NY state GFT estimates (16). b Updated GFT model time periods: the retrospective query selection model-fitting period was from September 28, 2003 through September 18, 2009; The prospective operation period has run from September 19, 2009 through March 30, 2013. Mid-Atlantic region states included only NJ and NY (14). The New York level comparison was based on New York City GFT estimates (16). Measurement of Epidemic Timing and Intensity All observed ILI weekly proportions were analyzed with a traditional Serfling regression approach to establish weekly expected baselines and estimate the “excess” ILI proportions attributable to influenza and identify epidemic periods ([28]–[33]; Supporting Information). The GFT system presents ILI search query estimates as a qualitative measure of influenza activity on a scale ranging from “minimal” to “intense” each week [16]; neither GFT model provided quantitative measure for detection or estimation of impact [13], [14]. For all public health surveillance and GFT estimates we assessed two epidemiological criteria to characterize influenza outbreaks: epidemic timing and intensity. Timing was based on estimates of epidemic onset and peak week for each season and ILI surveillance system. The onset each season was defined as the first of consecutive weeks exceeding the surveillance threshold (upper limit of the 95% confidence interval of the Serfling baseline). The peak week was identified as the week with the greatest proportion of ILI visits each season or epidemic (Table 2). 10.1371/journal.pcbi.1003256.t002 Table 2 Comparison of seasonal and epidemic week of onset and peak weeks as measured by Google Flu Trends (GFT) and public health influenza-like illness (ILI) surveillance data at the national (United States), regional (Mid-Atlantic) and local (New York City) levels. Time Period National, United States Regional, Mid-Atlantic States Local, New York City Week of Onset (Peak) ILI Surveillance Difference in Week of Onset (Peak) Original GFT modela Difference in Week of Onset (Peak) Updated GFT modelb Week of Onset (Peak) ILI Surveillance Difference in Week of Onset (Peak) Original GFT modela Difference in Week of Onset (Peak) Updated GFT modelb Week of Onset (Peak) ILI Surveillance Difference in Week of Onset (Peak) Original GFT modela Difference in Week of Onset (Peak) Updated GFT modelb 2003/2004 season 44 (52) +3 (−2) +3 (0) 48 (52) −1 (−1) 0 (0) 46 (52) +1 (−1) +1 (0) 2004/2005 season 51 (6) 0 (0) 0 (+1) 49 (51/6) +1 (+2/0) +1 (+1/+1) 47 (52) +3 (+1) +3 (+1) 2005/2006 season 49 (52/9) +2 (0/0) +2 (0/0) 48 (52/6) +4 (+1/+3) +4 (0/+3) 3 (6) −2 (+3) −3 (+1) 2006/2007 season 50 (52/7) +1 (0/−1) +1 (0/0) 47 (52/7) +4 (+1/+2) +5 (+1/+2) 47 (8) +4 (+1) +11 (0) 2007/2008 season 52 (7) +1 (+1) +3 (+1) 4 (7) −3 (+1) −3 (+1) 44 (7) +9 (+1) +9 (+1) 2008/2009 season 4 (6) −1 (+2) 0 (+1) 4 (8) 0 (−2) −3 (−2) 3 (7) −2 (−1) −2 (0) Spring 2009 pandemic A/H1N1 17 (17) *** 0 (0) 17 (21) *** 0 (+2) 17 (21) +3 (−1) 0 (0) 2009/2010 pandemic season ** (42) NA ** (0) ** (43) NA ** (+1) 34 (47) NA +1 (−3) 2010/2011 season 50 (5) NA +1 (+2) 48 (52/6) NA +3 (+1/+1) 46 (52) NA +4 (+7) 2011/2012 season 8 (11) NA −8 (−1) *** NA *** *** (52) NA *** (+1) 2012/2013 season 47 (52) NA −8 (+3) 48 (52) NA −9 (+3) 49 (3) NA −11 (0) Week of onset was identified as the first of consecutive weeks for each system and region above its Serfling regression 95% threshold, and peaks were identified as the weeks reporting the highest percent-ILI for each season or epidemic. The public health ILI onset and peak weeks are given by surveillance week for each season. The GFT model onset and peak weeks are given relative to the corresponding season/epidemic and regional ILI surveillance weeks. a Original GFT model time periods: The retrospective query selection model-fitting period was from September 28, 2003 through March 17, 2007; the prospective GFT model validation period was from March 18, 2007 through May 17, 2008 and ongoing operation was from May 18, 2008 through Aug 1, 2009. Mid-Atlantic region states included NJ, NY and PA (13). New York comparison was based on NY state GFT estimates (16). b Updated GFT model time periods: the retrospective query selection model-fitting period was from September 28, 2003 through September 18, 2009; The prospective operation period has run from September 19, 2009 through March 30, 2013. Mid-Atlantic region states included only NJ and NY (14). The New York level comparisons was based on New York City GFT estimates (16). ** National and Mid-Atlantic region data remained above threshold at the beginning of the 2009/2010 pandemic season. *** No consecutive weeks above threshold to identify onset or peak during this period. For each data source and season we assessed epidemic intensity by determining the proportion of excess ILI for peak weeks and by summing the weekly excess ILI proportions for each epidemic period as a measure of cumulative ILI intensity for each season and epidemic. All Serfling regression confidence intervals represented the upper and lower 95% limit, calculated as the predicted non-epidemic baseline ±1.96 standard deviations [28]–[33]. We calculated the ratio of excess GFT divided by excess ILI at each geographic level for each epidemic (Table 3), with a constant ratio indicating consistent influenza monitoring by GFT for the period. 10.1371/journal.pcbi.1003256.t003 Table 3 Comparison of epidemic intensity during the 2009 A/H1N1 influenza pandemic and the 2012/2013 seasonal A/H3N2 epidemic as measured by Google Flu Trends (GFT) and public health influenza-like illness (ILI) surveillance at the national (United States), regional (Mid-Atlantic) and local (New York City) levels. Epidemic peak Epidemic intensity as percent over baseline Comparison GFT to ILI surveillance Time Period and Geographic Location ILI% at peak week seasonal excess (95% CI) ratio excess GFT∶ILI ILI surveillance original GFT model updated GFT model ILI surveillance original GFT model updated GFT model prospective (retrospective) National, United States Spring 2009 pandemic A/H1N1 2.7 1.5 2.1 10.3 (6.1–14.5) 0.3 (0.1–0.6) 9.7 (5.5–13.9) 0.03 (0.94) Autumn 2009 pandemic A/H1N1 7.7 NA 7.1 59.2 (51.8–66.5) NA 43.8 (37.9–49.8) 0.74 2009 pandemic A/H1N1, both waves 7.7 NA 7.1 69.4 (57.9–81.0) NA 53.5 (43.4–63.7) 0.77 2012/2013 seasonal A/H3N2 6.1 NA 10.6 27.3 (21.7–32.9) NA 73.2 (63.7–82.6) 2.68 Regional, Mid-Atlantic States Spring 2009 pandemic A/H1N1 4.9 1.4 3.2 27.2 (21.9–32.5) 0.6 (0.03–1.1) 19.2 (15.4–23.0) 0.02 (0.71) Autumn 2009 pandemic A/H1N1 8.3 NA 7 52.1 (42.8–61.3) NA 40.2 (33.5–46.9) 0.77 2009 pandemic A/H1N1, both waves 8.3 NA 7.1 79.3 (64.7–93.8) NA 59.4 (48.9–70.0) 0.75 2012/2013 seasonal A/H3N2 5.7 NA 13 34.3 (27.3–41.4) NA 71.4 (65.9–76.8) 2.08 Local, New York City Spring 2009 pandemic A/H1N1 14.3 1.4 3.1 55.5 (52.2–58.8) 1.3 (0.4–2.1) 15.4 (10.9–19.8) 0.02 (0.28) Autumn 2009 pandemic A/H1N1 4.5 NA 4.4 26.5 (19.0–34.0) NA 24.3 (18.8–29.9) 0.92 2009 pandemic A/H1N1, both waves 14.3 NA 4.4 82.0 (71.2–92.8) NA 39.7 (29.7–49.7) 0.48 2012/2013 seasonal A/H3N2 5.9 NA 12.7 26.3 (21.2–31.42) NA 77.9 (68.2–87.5) 2.96 Epidemic intensity was measured by Serfling regression of weekly percent-ILI for public health surveillance data and GFT estimates for peak week and seasonal epidemic excess, with corresponding upper and lower 95% limit, calculated as the predicted non-epidemic baseline +1.96 standard deviations. Estimating Accuracy of Internet Search Query Data To further evaluate the week-to-week accuracy and timing of GFT and potential asynchrony with traditional ILI surveillance, we calculated Pearson correlations in the national, regional and local datasets, following the original methods used in the development [13] and evaluation of GFT [14]. Original and updated GFT model estimates were assessed for the periods of retrospective model-fitting and prospective monitoring (Table 2), and for specific epidemic seasons (Table 4). We measured cross-correlations at negative and positive lags for each influenza season to identify the corresponding lead or lag with the highest correlation values between GFT and traditional ILI systems, indicating the degree of shift in the timing of the GFT trends compared to ILI surveillance. 10.1371/journal.pcbi.1003256.t004 Table 4 Performance of Google Flu Trends (GFT) relative to public health influenza-like illness (ILI) surveillance at the national (United States), regional (Mid-Atlantic States) and local (New York City) levels for specific epidemic and pandemic seasons. Time Period and Geographic Location Original GFT model Updated GFT model R2 R2 ('+/− week lag, max R2) National, United States Influenza seasons 2003–2009 (prior to 2009 pandemic) 0.88 0.92 2009 pandemic A/H1N1 early wave 0.91 0.84 2009/2010 pandemic A/H1N1 season NA 0.98 2010/2011 season NA 0.95 2011/2012 season NA 0.88 2012/2013 season NA 0.90 Regional, Mid-Atlantic States Influenza seasons 2003–2009 (prior to 2009 pandemic) 0.75 0.77 2009 pandemic A/H1N1 early wave 0.51 0.82 2009/2010 pandemic A/H1N1 season NA 0.92 2010/2011 season NA 0.83 2011/2012 season NA 0.37 2012/2013 season NA 0.86 Local, New York City Influenza seasons 2003–2009 (prior to 2009 pandemic) 0.87 0.84 2009 pandemic A/H1N1 early wave 0.78 0.88 2009/2010 pandemic A/H1N1 season NA 0.51 (−3 wks, 0.89) 2010/2011 season NA 0.74 (+1 wk, 0.80) 2011/2012 season NA 0.80 2012/2013 season NA 0.94 While correlations are useful to assess GFT [14], they only provide a measure of relative correspondence between ILI and internet search systems, and do not provide an indication of the nature of the relationship between the trend estimates or the observed lags. As a complementary measure, we compared the regression slope of public health ILI data with GFT estimates during retrospective model-fitting and prospective periods, and for specific seasons. For further details, see Supporting Information. Results During the study period, June 2003 to March 2013, over 4.5 million ILI visits out of 230 million total outpatient sentinel physician visits were reported nationwide to the CDC ILI-Net surveillance network, of which 16.5% were from the Mid-Atlantic surveillance region. In New York City, over 780,000 ILI and 38 million total ED visits were recorded in the DOHMH syndromic surveillance system, with coverage increasing from 88% of all ED visits that occurred citywide during 2003/2004 to >95% of all visits since 2008. The weekly proportion of ILI visits and GFT estimates showed similar seasonal and epidemic patterns across the three regional scales, though with notable differences between retrospective and prospective periods (Figure 1; Table 1). Specifically, during prospective use the original GFT algorithm severely underestimated the early 2009 pandemic wave (shaded 2009 period, Figure 1), and the updated GFT model greatly exaggerated the intensity of the 2012/2013 influenza season (shaded 2012/2013 period, Figure 1). Original GFT Model, 2003–2009 Prior to the Pandemic Historical estimates from the original GFT model were based on the model-fitting period from September 28, 2003 to March 17, 2007; the system was evaluated during March 18, 2007 to May 11, 2008, and has run prospectively since then. The week-to-week GFT estimates during the model-fitting period were highly correlated with ILI surveillance data at the national (R2 = 0.91), regional (Mid-Atlantic, R2 = 0.79) and state/local level (New York, R2 = 0.89; Table 1). Similarly, GFT estimates were highly correlated with CDC ILI surveillance at the national and regional levels during the validation period [13], and remained high through the period of prospective use prior to the emergence of the 2009 A/H1N1 pandemic, from May 12, 2008 to March 28, 2009 (R2≥0.75; Table 4). Seasonal and epidemic onset and peak weeks varied considerably during the period (Table 2). Estimation of excess ILI visits and GFT search query fractions were also well correlated on a week to week basis during this period (Supporting Tables; Figure 2). 10.1371/journal.pcbi.1003256.g002 Figure 2 Scatter plots of weekly excess influenza-like illness (ILI) visit proportions against original Google Flu Trends (GFT) model search query estimates, 2003–2009. Weekly excess percent-ILI is calculated as Serfling estimates subtracted from observed proportions. Plots show original GFT model estimates compared with weighted CDC ILI-Net data for (A) the United States, and (B) Mid-Atlantic Census Region States (New Jersey, New York, Pennsylvania), and local ILI surveillance from emergency department visits for (C) New York City. Plots are shown for pre-pandemic influenza seasons, June 1, 2003 to April 25, 2009 (grey circles) and the early wave of the A/H1N1 pandemic, April 26 to August 1, 2009 (red diamonds). Lines representing equivalent axes for X = Y are shown (grey dashed line). Regression lines are shown for seasonal influenza 2003–2009 (black line) and the early 2009 wave of the pandemic (red line). Original GFT Model during the First Wave of the 2009 Pandemic In late-April 2009, detection of novel A/H1N1 influenza in an outbreak in Queens, New York, was immediately followed by a spike in ILI surveillance data across much of the nation during the week ending May 2, 2009 [2]. Mid-Atlantic States and New York City experienced a substantial spring pandemic wave (Figure 1B,C), unlike many other regions of the US [2]. Despite recognized pandemic activity, the national GFT estimates were below baseline ILI levels for May–August 2009, indicating no excess impact (red line, shaded 2009 period, Figure 1A). The correlations between the surveillance ILI and GFT estimates, however, were very high during this period at the US level for observed (R2 = 0.91) as well as estimated excess values (R2 = 0.81; Figure 2A). At the Mid-Atlantic level, correlations were lower for observed (R2 = 0.51), but still high for estimated excess values (R2 = 0.80), while the slope of the linear relationship between the two surveillance systems was near zero (slope = 0.11), indicating that there was little or no excess ILI estimated by GFT (Figure 2B). The discrepancy at the Mid-Atlantic level was exacerbated for New York City, where the pandemic impact was greater than any other epidemic that decade, while the original GFT estimates remained near expected baseline levels for the entire period (R2 = 0.78). Accordingly, the slope of the GFT regression against ILI was near zero (slope = 0.05), indicating that GFT data did not accurately measure the intensity of the pandemic (Figure 2C). Taken together, the original GFT model missed the spring 2009 pandemic wave at all levels (Figure 1), providing incidence estimates 30–40 fold lower than those based on ILI surveillance (Table 3). Updated GFT Model, Retrospective Period 2003–2009 The original and updated GFT estimates appeared very similar during the pre-pandemic period 2003–2009, but diverged considerably by May 2009 (red and blue lines, Figure 1). Like the original GFT model, the updated GFT estimates during the model-fitting period were highly correlated with CDC ILI surveillance at the national and regional levels (R2≥0.77, Table 1). In contrast for New York City, the updated GFT estimates were less well correlated with local ILI syndromic surveillance data during this period (R2 = 0.51, Table 1). Of particular interest is the retrospective characterization of the 2009 pandemic by the updated GFT algorithm, which tracked the spring wave very well at the national level, but underestimated the magnitude at the regional level by nearly 30%, and at the New York City level by 70% (Figure 1; Table 3). Updated GFT Model Ability to Track the Fall 2009 Pandemic In September 2009, the updated GFT algorithm began running prospectively, providing estimates that tracked CDC ILI surveillance data well for the remainder of 2009, a period in which most pandemic A/H1N1 infections occurred. Updated GFT estimates were highly correlated with ILI surveillance at the national (R2 = 0.98), and regional (R2 = 0.92) levels (Figure 1A–B; Table 4). Mid-Atlantic ILI surveillance, however, demonstrated two peaks, consistent with different timing of pandemic waves in states within the region (Figure 1B). For New York City, the updated GFT estimates and ILI surveillance were less well correlated when measured directly (R2 = 0.51), though highly correlated when lagged by three weeks (R2 = 0.89), showing the updated GFT model estimates for the fall 2009 pandemic wave to increase and peak 3 weeks earlier than ILI surveillance (Figure 1C; Table 4). Overall, GFT underestimated the cumulative ILI incidence of the main pandemic period, May–December 2009, by 52% for New York City (25% for the broader region), with non-overlapping confidence intervals between the GFT and ILI surveillance systems (Table 3). Updated GFT Model Performance during 2010–2012 Correlations between the updated GFT model and ILI data during the first two years of prospective post-pandemic surveillance were high at the national level during the 2010/2011 (R2 = 0.95) and 2011/2012 (R2 = 0.88) seasons (Table 4). At the regional level, there was high correlation in 2010/2011 (R2 = 0.83) with a slight underestimation of incidence, and low correlation in 2011/2012 (R2 = 0.37) with a slight overestimation of ILI incidence (Figure 1B). At the New York City level, updated GFT estimates for 2010/2011 were reasonably well correlated with observed ILI (R2 = 0.74), though with ILI surveillance increasing and peaking earlier (Figure 1C), and showing an improved lagged correlation (R2 = 0.80, lagged 1 week; Table 4). Updated GFT Model Performance during the 2012/2013 Season For the relatively early and moderately severe 2012/2013 epidemic season, observed GFT estimates greatly overestimated the initial onset week and magnitude of the outbreak at all three geographical levels (Figure 1; Table 2). The correlations between the updated GFT model estimates and ILI surveillance, however, were very high at all levels (R2≥0.86, Table 4). GFT model estimates of epidemic intensity were far greater than ILI surveillance data at the national (268%), regional (208%) and local (296%) levels (Table 3). Accordingly, the slopes of the weekly regression of ILI surveillance against GFT estimates during 2012/2013 (United States, slope = 1.91; Mid-Atlantic, slope = 2.29; New York City, slope = 2.63) were far greater than those for other epidemic and pandemic seasons (Figure 3), and substantially different from a slope of 1 (p<0.05). 10.1371/journal.pcbi.1003256.g003 Figure 3 Scatter plots of weekly excess influenza-like illness (ILI) visit proportions against updated Google Flu Trends (GFT) model search query estimates, 2003–2013. Weekly excess percent-ILI is calculated as Serfling estimates subtracted from observed proportions. Plots show updated GFT model estimates compared with weighted CDC ILI-Net data for (A) the United States, and (B) Mid-Atlantic HHS-2 Region States (New Jersey, New York), and local ILI surveillance from emergency department ILI visit data for (C) New York City. Plots are shown for weeks June 1, 2003 to April 25, 2009 (grey circles), April 26 to January 2, 2010 (red diamonds), January 3, 2010 to Oct 6, 2012 (grey squares), and October 7, 2012 to March 30, 2013 (blue triangles). Lines representing equivalent axes for X = Y are shown (grey dashed line). Regression lines are shown for the 2003/2004–2008/2009 seasons (black line), 2009 pandemic (red line), 2010/2011–2010/2012 seasons (grey solid line) and the 2012/2013 season (blue line). Discussion Following Google's development of GFT in 2008, and the considerable excitement generated by the original publication and release of the online tool [12], [13], [16], concerns were raised regarding the tenuous relationship between internet searches and the presentation of illness to clinical or emergency medical providers [17]. We used clinical ILI surveillance data at local, regional and national scales as a proposed “ground truth” to test the ability of GFT to perform as a timely and accurate surveillance system in the US. We identified substantial errors in GFT estimates of influenza timing and intensity in the face of pandemic and seasonal outbreaks, including prospectively missing the early wave of the 2009 pandemic and overestimating the impact of the 2012/2013 epidemic. Although we are not the first to report issues in GFT estimates for seasonal and pandemic influenza [22], our study is the first to carefully quantify the performance of this innovative system over a full decade of influenza activity and across three geographical scales. The 2009 A/H1N1 pandemic is a particularly important case study to test the performance of GFT, with its unusual signature pandemic features of out-of-season activity in the spring of 2009, atypical (young) age pattern of cases, recurring waves and substantial geographic heterogeneity [34]–[38]. Immediately following the spread of the pandemic virus in the US, public health officials and electronic surveillance networks found that local and state level surveillance data did not correspond with estimates provided by the original GFT model, particularly in some urban areas and harder hit regions of the Northeastern and Midwestern US [18], [39]. Clearly, the original GFT algorithm was not able to track sentinel ILI patterns that deviated from the influenza seasons that occurred during the model-fitting period. Even after the GFT algorithm was revised in September 2009, we have shown that the retrospective estimates for the spring 2009 pandemic wave were still not in agreement with regional and local surveillance. Further, the updated GFT model that has been used prospectively failed to accurately capture the autumn 2009 pandemic wave in New York City, presenting the outbreak three weeks before it actually occurred. This assessment echoes earlier concerns regarding the timeliness and accuracy of internet search data for public health monitoring at the local level [17] and during the early wave of the 2009 pandemic [18]. To have missed the early wave of the 2009 pandemic is a serious shortcoming of a surveillance system – as these are times when accurate data are most critically needed for pandemic preparedness and response purposes. Although the GFT system provided relatively accurate estimates during post-pandemic years which were characterized by mild influenza activity, it overestimated the 2012/2013 epidemic by 2–3 fold relative to traditional ILI surveillance systems, across national, regional and local geographical levels in the US (see also [22]). While the intensity of the 2012/2013 influenza season was roughly comparable to the 2003 A/H3N2-Fujian epidemic as measured by traditional surveillance and assessed by CDC as “moderately severe” [2], the 2012/2013 season was scored by the GFT tool as by far the most severe epidemic in over a decade. A limitation of our study is its focus on US systems. Many international syndromic, physician consultation, laboratory and internet survey surveillance systems provide rapid, detailed and accurate influenza-related surveillance [3]–[5], [40]–[48]. These systems allowed for development of GFT search query algorithms which were trained to mimic the specific regional influenza-related patterns [16]. While international GFT search query estimates are publically available earlier than many government run surveillance systems, it is important to note that public health data typically undergo monitoring for data quality and investigation prior to public release. It is also important to note that GFT has been set up where robust surveillance systems already exist, providing ILI search query data for populations that are already under surveillance. An additional limitation of our study is the imperfect nature of our assumed “ground truth” surveillance. Our study sought to assess the ability of GFT to estimate physician consultation and syndromic ILI surveillance patterns, not necessarily the true incidence of influenza infection and illness. We recognize that physician sentinel and syndromic data can be biased, particularly during periods of heightened public health concern. This has been well described in a study of online survey data and health-seeking behavior during the two waves of the 2009 pandemic in England [48]. This recognized bias highlights the need for multiple sources of surveillance information in the community. In a previous evaluation of GFT, the authors and engineers at Google and the US CDC concluded that their original GFT model had “performed well prior to and during” the 2009 pandemic, when assessed as simple correlations at national and regional levels [14]. Regarding this measure of performance, however, we found the use of simple correlation to be inadequate, as values greater than 0.90 often occurred during periods when critical metrics such as peak magnitude and cumulative ILI revealed that the GFT models were actually greatly under- or over-estimating influenza activity. Our study demonstrates that simple correlation measures can mischaracterize the performances of a novel surveillance system, and instead we recommend the use of additional and alternative metrics based on estimates of onset and peak timing and cumulative intensity of influenza epidemics. Because the search algorithm and resulting query terms that were used to define the original and updated GFT models remain undisclosed, [13], [14], it is difficult to identify the reasons for the suboptimal performance of the system and make recommendations for improvement. Concerns were raised early-on that the data-mining nature of GFT might over-fit the historical data and introduce bias in prospective use [17]. After the original GFT model missed the spring 2009 pandemic wave – an outbreak with different timing and characteristics than the outbreaks present in the retrospective model-fitting period – the GFT algorithm was modified, potentially addressing the possible over-fitting issue. The revised GFT model, however, appeared to be susceptible to bias in the opposite direction, possibly due to changes in health information searching and care seeking behavior driven by the media. Further, important epidemiologic information such as patient age, location, illness complaint or clinical presentation remain un-available in GFT (an adult person could be performing a search on behalf of a sick minor in another state). In contrast, public health information systems are less prone to such biases, as they collect demographic and geographic data as well as additional health outcomes, which can be used to investigate atypical signals. Ultimately, public health actions are taken locally. As such, the accuracy and timeliness of local disease surveillance systems are critical; as is the utility of the information in supporting decisions. The additional detail in local syndromic ILI surveillance data, and its direct link to individuals seeking care, facilitates public health action. Computerized surveillance, such as the New York City syndromic chief complaint ED system, can accurately capture the impact of influenza activity [25], [26]. In the present study, we have shown that these systems are more accurate than, yet equally timely as the GFT tool, which indicates the need for further research and support for computerized local disease surveillance systems. We believe there is a place for internet search query monitoring in disease surveillance, and for continued research and development in this area [13]–[21], [49]–[58]. For now, in the US CDC's national and regional ILI surveillance data remain the “ground truth” source of influenza activity at national and regional levels, but timeliness, detail and coverage remain issues. Thus, we believe there is a broader need for electronic clinically-based disease surveillance at the local level, similar to the ED system in place in New York City [25]–[27], and for collaborative and distributed networks connecting these systems for research and practice [39], [58]–[60]. Careful evaluation of the strengths and limitations of GFT and other innovative surveillance tools should be expanded to encompass a range of developed and developing country settings, following the approach proposed here, in order to improve local, regional and global outbreak surveillance methods and inform public health responses. The way forward using high volume search query data such as GFT may be through integration of near-real time electronic public health surveillance data, improved computational methods and disease modeling – creating systems that are more transparent and collaborative, as well as more rigorous and accurate, so as to ultimately make them of greater utility for public health decision making. Supporting Information Figure S1 National level influenza season observed and model baseline data, 2003–2013. (PDF) Click here for additional data file. Figure S2 Comparison of national level influenza-like illness (ILI) surveillance and Google Flu Trends (GFT) original and updated models. (PDF) Click here for additional data file. Figure S3 Mid-Atlantic state seasonal observed and model baseline, 2003–2013. (PDF) Click here for additional data file. Figure S4 Comparison of Mid-Atlantic state influenza-like illness (ILI) surveillance and Google Flu Trends (GFT) original and updated models. (PDF) Click here for additional data file. Figure S5 New York influenza season observed and model baseline data, 2003–2013. (PDF) Click here for additional data file. Figure S6 Comparison of New York State and New York City Google Flu Trends (GFT) updated models. (PDF) Click here for additional data file. Figure S7 Comparison of New York City emergency department (ED) influenza-like illness (ILI) syndrome surveillance and New York City and State Google Flu Trends (GFT) original and updated models. (PDF) Click here for additional data file. Table S1 Influenza epidemic season intensity, national level in the United States, 2003–2013. (PDF) Click here for additional data file. Table S2 Google Flu Trends (GFT) model correlation, national level in United States, 2003–2013. (PDF) Click here for additional data file. Table S3 Influenza season epidemic intensity in Mid-Atlantic States, 2003–2013. (PDF) Click here for additional data file. Table S4 Google Flu Trends (GFT) model correlation, Mid-Atlantic States, 2003–2013. (PDF) Click here for additional data file. Table S5 Influenza season epidemic intensity in New York, 2003–2013. (PDF) Click here for additional data file. Table S6 Google Flu Trends (GFT) model correlation, New York, 2003–2013. (PDF) Click here for additional data file. Text S1 Technical appendix: Supplementary methods and results. (PDF) Click here for additional data file.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time

              Introduction Each year, there are an estimated 250,000–500,000 deaths worldwide that are attributed to seasonal influenza [1], with anywhere between 3,000–50,000 deaths occurring in the United States of America (US) [2]. In the US, the Centers for Disease Control and Prevention (CDC) continuously monitors the level of influenza-like illness (ILI) circulating in the population by gathering information from sentinel programs which include virologic data as well as clinical data, such as physicians who report on the percentage of patients seen who are exhibiting influenza-like illness [2]. While the CDC ILI data is considered to be a useful indicator of influenza activity, its availability has a known lag-time of between 7–14 days, meaning that by the time the data is available, the information is already 1–2 weeks old. To appropriately distribute vaccines, staff, and other healthcare commodities, it is critical to have up-to-date information about the prevalence of ILI in a population. There have been several attempts at gathering non-traditional, digital information to be used to predict the current or future levels of ILI, and other diseases, in a population [3]–[11]. The most notable of these attempts to date has been Google Flu Trends (GFT), a proprietary system designed by Google, which uses Google search terms that are correlated with ILI activity in the US to make a estimation of the current level of ILI [12]. Google Flu Trends was initially quite successful in its estimation of ILI activity, but was shown to falter in the face of the 2009 H1N1 swine influenza pandemic (pH1N1) due to much-increased levels of media attention surrounding the pandemic [13]. Similarly, GFT greatly over-estimated ILI activity in the 2012–2013 influenza season, again likely due to that fact that it was a more severe influenza season than normally observed and therefore garnered much media attention [14]. In the face of these obstacles, Google has continued to update and re-evaluate its models [15]–[17]. Although GFT has performed well in the past, with the exception of two high ILI activity time periods, new methods of estimating current ILI activity that are less susceptible to error in the face of media coverage should be sought. Additionally, as the global community continues to become increasingly in favor of open-access data and methods [18], new methods of ILI estimation should be freely available for everyone to investigate and improve upon, unlike GFT, which does not share the search terms it uses in its algorithms (though results are public). To this end, we have created a method of estimating current ILI activity in the US by gathering information on the number of times particular Wikipedia articles have been viewed. Wikipedia is a massive, user-regulated, online encyclopedia. Launched in 2001, Wikipedia harnesses the power of the online community to create, edit, and modify encyclopedia-like articles that are then freely available to the entire world. Currently operating in 232 languages, Wikipedia has ∼30 million articles available, expanding at approximately 17,800 articles per day, with nearly 506 million visitors per month, representing 27 billion total page views since its launch, and has approximately 31,000 active Wikipedia editors (http://stats.wikimedia.org) [19]. With a wealth of detailed information on an almost limitless range of topics, Wikipedia is ideally suited as a platform that could potentially be of use for legitimate scientific investigation in many different areas. Not only is the information held within Wikipedia articles very useful on its own, but statistics and trends surrounding the amount of usage of particular articles, frequency of article edits, region specific statistics, and countless other factors make the Wikipedia environment an area of interest for researchers. It has previously been shown that Wikipedia can be a useful tool to monitor the emergence of breaking news stories, to track what topics are “trending” in the public sphere, and to develop tools for natural language processing [20]–[23]. Furthermore, Wikipedia makes all of this information public and freely available, greatly increasing and expediting any potential research studies that aim to make use of their data. The purpose of this study was to develop a statistical model to provide near real-time estimates of ILI activity in the US using freely available data gathered from the online encyclopedia, Wikipedia. Methods Wikipedia Articles of Consideration In an attempt to use Wikipedia data to estimate ILI activity in the US, we compiled a list of Wikipedia articles that were likely to be related to influenza, influenza-like activity, or to health in general. These articles were selected based on previous knowledge of the subject area, previously published materials, and expert opinion. In addition to articles that were potentially related to ILI activity, several articles were selected to act as markers for general background-level activity of normal usage of Wikipedia. For example, information was gathered on the number of times the Wikipedia main page (www.en.wikipedia.org/wiki/Main_page) was accessed per day, as a measure of normal website traffic. As well, the Wikipedia article for the European Centers for Disease Control was included in models in an attempt to control for non-American article views. Table 1 displays the Wikipedia articles that were considered for inclusion in our models. 10.1371/journal.pcbi.1003581.t001 Table 1 List of Wikipedia articles selected for investigation for inclusion in ILI estimation models. Avian influenza* Influenza Virus B* Centers for Disease Control and Prevention* Influenza Virus C* Common Cold* Influenza Virus Subtype H1N1 Epidemic* Influenza Virus Subtype H2N2* European Centers for Disease Control and Prevention Influenza Virus Subtype H2N9* Fever* Influenza Virus Subtype H3N1* Flu Season* Influenza Virus Subtype H3N2* Human Influenza* Influenza Virus Subtype H5N1* Influenza Influenza Virus Subtype H5N2* Influenza-like Illness* Oseltamivir* Influenza Pandemic Pandemic Influenza Research* Swine Influenza Influenza Treatment* Tamiflu* Influenza Vaccine* Vaccine Influenza Virus* Wikipedia Main Page Influenza Virus A* 1918 Flu Pandemic* *Only terms with an asterisk were included in the Lasso regression model. Wikipedia article view information is made freely available by Wikipedia, under a project called Wikimedia Statistics (http://en.wikipedia.org/wiki/Wikipedia:Statistics), and is available as the number of article views per hour, which may include multiple views on the same article by the same user. A freely available, user-written tool was independently developed to more easily access the information that Wikipedia makes available (http://stats.grok.se), which aggregates article view data to the day-level, and this tool was used to gather total daily article view information. Daily Wikipedia article view data was retrospectively collected beginning at the earliest available date, December 10, 2007, through to August 19th, 2013, and then aggregated to the week level, with each week beginning on Sunday. CDC and GFT Data The CDC compiles data on the weekly level of ILI activity in the United States by collecting information from sentinel sites across the country where physicians report on the number of patients with influenza-like illness. CDC ILI data is freely available through ILInet, via the online FluView tool (www.cdc.gov/flu/weekly), and downloadable as week-level data. Google Flu Trends data is also freely available through the Google Flu Trends website (http://www.google.org/flutrends) and is provided weekly at the country and state level. GFT data is the result of Google's proprietary algorithm that uses Google search queries to estimate the level of ILI activity in a given region. Data Collection We gathered Wikipedia article view data beginning from the week of December 10th, 2007, the earliest records available, until August 19th, 2013. Accordingly, retrospective CDC ILI data and GFT data was obtained for the same period as the Wikipedia article view information, although both the CDC and GFT data extends much further back in time. When aggregated to week-level, all data sources accounted for 296 weeks of retrospective information, capturing five full influenza seasons as well as partial 2007–2008 data. Due to a lapse in the Wikipedia database, article view information is not available between July 13th and July 31st, 2008, inclusive. Therefore, the total set of data available accounts for 294 weeks. Influenza-Like Illness Modeling Models to estimate ILI activity using Wikipedia article view information were developed using a generalized linear model framework. The outcome variable, age-weighted CDC ILI activity, is a proportion and is therefore appropriately modeled using a Poisson distribution, and so the Poisson family was used in the GLM framework, with a log-link function. In an attempt to adjust for potential over-fitting, models were run using jackknife resampling. Two principle models were created, which include Mf, a Poisson model that used the full set of collected Wikipedia article page view data, and Ml, a Poisson model that used Lasso (Least Absolute Shrinkage and Selection Operator) regression analysis. Lasso regression dynamically and automatically selects predictor variables for inclusion or exclusion by penalizing the absolute size of the regression coefficients toward zero, thereby selecting a subset of predictor variables which best describe the outcome data [24], [25]. To investigate the reliability of the models, we used a split-sample analysis on the Ml models to compare how well the Lasso selected predictors for a subset of the data (including years 2007, 2008, 2009, and 2010) accounted for the observed data in the remaining subset (years 2011, 2012, and 2013). Additionally, each of these aforementioned models were also run while excluding data at key time periods which reflect higher than normal ILI activity or Wikipedia article view traffic (during the early weeks of the 2009 pandemic H1N1 swine influenza pandemic and the unusually severe influenza season of 2012–2013) as a means of investigating the models' ability to deal with large data spikes. By comparing the models with or without higher than normal Wikipedia usage, we can investigate what impact, if any, spikes in Wikipedia activity (potentially caused by increased media reporting of influenza-related events) have on the accuracy of the models, and whether or not these spikes in traffic need to be accounted for. In addition to a factor variable representing the year being included in the models, the month was also controlled for in an effort to adjust for the seasonal patterns that influenza outbreaks exhibit in the United States. All models were investigated for appropriate fit using the Pregibon's goodness-of-link test [26] and by examining Anscombe and deviance residuals. Models were compared to one another by comparing Akaike's Information Criteria, response statistics, and by performing likelihood-ratio tests on the maximum-likelihood values of each model. Goodness-of-fit (GOF) tests, both Pearson and deviance, were tested for; all presented models had GOFs≫0.05. All statistics and models were performed using Stata 12 (Statacorp., College Station, Texas, US). Results Across the 294 weeks of data available, the number of views of each Wikipedia article under consideration showed large variability. As an example of this variation, the mean number of daily views of the “Influenza” article was 30,823, but the total number of views ranged from 3,001–334,016 per day. Some of the articles under investigation had relatively few views, such as “influenza-like illness” with a mean of 1,061 article views per day (range: 0–15,629 views per day), while others had very high numbers of views per day, such as the Wikipedia Main Page, which had a mean of 44 million views per day (range: 7–139 million views per day). Herein, we will discuss the characteristics of several models in an attempt to use Wikipedia article view information to estimate nationwide ILI activity based on CDC data. We consider a full model (Mf) that includes all dependent variables that were investigated and a Lasso-selected model (Ml) that includes only dependent variables chosen as significant by the Lasso regression method. Full-Data Models The Mf model, containing all 35 predictor variables (including year, month, CDC page views, ECDC page views, and Wikipedia Main Page views) and 294 weeks of data, resulted in a Poisson model with an AIC value of 2.795. Deviance residuals for this model ranged from −0.971–1.062 (mean: −0.006) and were approximately normally distributed. Although many of the dependent variables showed spikes in page view activity around the beginning of the 2009 pH1N1 event, the Mf model was able to accurately estimate the rate of ILI activity, with a mean response value (difference between observed and estimated ILI values) of 0.48% in 2009 between weeks 17–20, inclusive. Overall, the absolute response values for the Mf model ranged from 0.00–2.38% (mean: 0.27%, median: 0.16%). In comparison, the absolute response values between CDC ILI data and GFT data ranged from 0.00–6.04% (mean: 0.42%, median: 0.21%). The Pearson correlation coefficient between the CDC ILI values and the estimated values from the Mf model was 0.946 (p<0.001). The actual observed range of ILI activity throughout the entire period for which data is available, as reported by the CDC, was from 0.47–7.72%, with a median value of 1.40%. In comparison, the Mf model estimated ILI activity for the same period ranged from 0.44–8.37%, with a median value of 1.50%, and the GFT ILI data ranged from 0.60–10.56%, with a median value of 1.72%. The Ml model, which contained 26 variables (including year, month, and CDC page views) that were chosen as significant by the Lasso regression method, resulted in a model with an AIC of 2.764. Deviance residuals for this model ranged from −0.790 to 1.205 (mean: −0.007) and were approximately normally distributed, though less so than in Mf. The absolute response values for this Ml model ranged from 0.00–2.53% (mean: 0.29%, median: 0.18%). During weeks 17–20 of the 2009 pH1N1 event, the mean response value for this model was 0.45%, suggesting it was slightly less accurate over this unusually high article view activity time period than the Mf model for the same period. The Pearson correlation coefficient between CDC ILI data and the estimated mean value for the Ml model was 0.938 (p<0.001), and the range of estimated ILI values for this model was from 0.55–8.66%, with a median value of 1.48%. Split-sample analysis was used to investigate the reliability of the Ml model. A Lasso regression model that was trained on data from years 2007–2010, inclusive, and the selected predictor variables were used to estimate the ILI activity for each week in the remainder of the dataset (years 2011–2013, inclusive). The cross-validation Pearson correlation between the actual observed CDC ILI data and the ILI estimates provided by the Ml model based on the first subset of data was 0.9854 (p<0.001). Figure 1 shows the time series for CDC ILI data, GFT data, and the estimated ILI values from both the Mf and Ml models. 10.1371/journal.pcbi.1003581.g001 Figure 1 Time series plot of CDC ILI data versus estimated ILI data. (A) Wikipedia Full Model (Mf) accurately estimated 3 out of 6 ILI activity peaks and had a mean absolute difference of 0.27% compared to CDC ILI data. (B) Wikipedia Lasso Model (Ml) accurately estimated 2 out of 6 ILI activity peaks and had a mean absolute difference of 0.29% compared to CDC ILI data,. (C) Google Flue Trends (GFT) model accurately estimated 2 of 6 ILI activity peaks and had a mean absolute difference of 0.42% compared to CDC ILI data. Models without Peak Activity In the following models, data from the beginning weeks of the 2009 pH1N1 event (weeks 17–20, inclusive), which showed large spikes in Wikipedia article views due to increased media attention, were excluded from analyses. As well, because of the higher-than-normal influenza activity of the 2012–2013 influenza season, that data was also removed from analyses, beginning from week 40 of 2012 to week 13 of 2013, inclusive. By running the Poisson models without these high volume time-sections, comparisons can be made to the full models in order to investigate the estimating ability of models in the face of higher-than-normal levels of influenza activity or Wikipedia article views. When removing the above-mentioned data, the Mf model produced an AIC value of 2.772, only marginally smaller than that of the complete Mf model, and was comprised of 263 weeks of data. The range of deviance residuals from this model, −0.650 to 0.891, is slightly narrower than the complete Mf model, suggesting a better fit. For the truncated Lasso model, the Poisson regression model was refit to only include the available data, and therefore produced a different set of 24 predictor variables. From this model, an AIC value of 2.727 was obtained, with a range of deviance residuals from −0.677 to 1.081, a marginal narrowing over the original Ml model. Pearson correlation coefficient values between CDC ILI data and estimated values by the Mf and Ml models, for peak-truncated data, were 0.958 (p<0.001) and 0.942 (p<0.001), respectively. Peak Influenza-Like Illness Estimation In the United States, seasonal influenza activity usually peaks during January or February. Using the maximum value of the CDC ILI data in a single influenza season as the true peak time and value, we compared the peak value and week for influenza activity as estimated by our two models, Mf and Ml, as well as the Google Flu Trends data. Results are summarized by model and by year in Table 2. 10.1371/journal.pcbi.1003581.t002 Table 2 Comparisons of CDC, Mf, Ml, and GFT peak ILI values. Influenza Season Year Week ILI Value Referent CDC ILI Value* % Difference from CDC ILI Value Peak Agrees with CDC 2007–2008 CDC Peak 2008 7 5.98 Mf Peak 2008 8 4.94 5.62 0.68 N Ml Peak 2008 7 4.43 5.98 −1.55 Y GFT Peak 2008 8 5.81 5.62 0.19 N 2008–2009 CDC Peak 2009 7 3.57 Mf Peak 2009 12 3.48 2.43 −1.05 N Ml Peak 2009 12 3.33 2.43 0.90 N GFT Peak 2009 8 3.50 3.37 0.13 N 2009–2010 CDC Peak 2009 43 7.72 Mf Peak 2009 43 8.36 7.72 −0.64 Y Ml Peak 2009 44 8.66 7.55 1.11 N GFT Peak 2009 43 7.11 7.72 −0.61 Y 2010–2011 CDC Peak 2011 4 4.55 CDC Peak 2011 6 4.55 Mf Peak 2011 6 5.84 4.55 −1.29 Y Ml Peak 2011 6 5.73 4.55 1.18 Y GFT Peak 2011 6 4.08 4.55 −0.47 Y 2011–2012 CDC Peak 2012 10 2.39 Mf Peak 2012 7 2.68 2.24 −0.44 N Ml Peak 2012 7 2.85 2.24 −1.55 N GFT Peak 2011 52 2.86 1.74 1.12 N 2012–2013 CDC Peak 2012 51 6.07 Mf Peak 2012 51 5.31 6.07 0.76 Y Ml Peak 2012 52 5.40 4.65 −1.55 N GFT Peak 2013 2 10.56 4.52 6.04 N ILI: Influenza-like illness, CDC: Centers for Disease Control and Prevention. Mf: Full model, Ml: Lasso model, GFT: Google Flu Trends. *Referent values are CDC ILI values for the corresponding week of the estimated ILI peak for Mf, Ml, and GFT. The Mf model was able to accurately estimate the ILI activity peak in 3 of 6 influenza seasons for which data is available (2009–2010, 2010–2011 and 2012–2013 seasons), and was within one week of an accurate estimation in another season (2007–2008). The Ml model accurately estimated the ILI peak activity week in 2 of 6 seasons (2007–2008 and 2010–2011), and estimated 2 others within a week (2009–2010 and 2012–2013). In comparison, Google Flu Trends data was able to accurately estimate peaks of seasonal ILI activity in 2 of 6 influenza seasons (2009–2010 and 2010–2011 season), and was accurate within one week in 2 other influenza season (2007–2008 and 2008–2009). It should be noted that in the 2010–2011 season, the CDC data peaked at the same ILI percentage at both week 4 and week 6 in 2011, and week 6 was taken to be the true peak, as it agreed with both Wikipedia models and the GFT data. In the 2011–2012 season, the Mf and Ml models were 3 weeks early in their estimation of peak ILI activity and the GFT data was 10 weeks early. Finally, in the 2012–2013 influenza season, the GFT model was 3 weeks late and grossly over-estimated the severity by greater than 2.3-times. Discussion Weekly ILI values based on Wikipedia article view counts were able to estimate US ILI activity within a reasonable range of error, with CDC data as the gold standard. While the CDC ILI data is routinely used as a gold standard, and is most often the best available source of ILI information for the country, this data source has potential biases of its own. There are over 2,900 outpatient healthcare providers that are registered participants of the CDC's ILI surveillance program, but in any given week, only approximately 1,800 provide ILI surveillance data [27]. As well, the population size/density of the area served by each outpatient healthcare provider is not uniform across locations and may lead to a skew in reporting. Additionally, increased media coverage of influenza may prompt healthcare providers to submit more samples for analysis or to report more potential ILI cases than they may have otherwise. Several models were fit to estimate ILI activity, including a model containing all 32 health-related Wikipedia articles investigated, a Lasso regression model which selected 24 health-related Wikipedia articles of significance, and each of these models were run without high media-awareness time periods representing the beginning of the H1N1 pandemic in spring of 2009 and the higher-than-normal ILI rates of the 2012–2013 influenza season. These models were compared to official CDC ILI values as well as GFT data. Comparing the Mf and Ml models, the AIC value was slightly smaller for the Ml model, as was its range of estimation residuals. With a highly non-significant likelihood ratio test between the two models, there is no evidence to suggest that the Mf model performs better than the Ml model, which may be preferred here. However, since there is no cost or energy associated with collecting additional variable information, the full model may warrant continued use to account for the potential event where more health-related Wikipedia articles become useful in ILI estimation. Mf and Ml models that did not include data for the 2009 spring pH1N1 season and the 2011–2012 peak season resulted in slightly smaller AIC and residual values compared to their full-data counterparts, but did not show large enough improvements in estimates to suggest that higher than normal Wikipedia page view traffic or ILI activity were major factors in the models' ability to estimate ILI activity. This result exemplifies the Wikipedia model's ability to perform well in the face of increased media attention and higher than normal levels of ILI activity, whereas GFT has been shown on several occasions to be highly susceptible to these types of perturbations. In comparison to GFT data, there are some areas where the Wikipedia models were superior, but others where they were not. Full Wikipedia models were able to estimate the week of peak activity within a season more often than GFT data. Out of the 6 seasons for which data was available, GFT estimated a value of ILI that was more accurate (regardless of whether or not the peak timing was correct) than the Mf or Ml models in 4 seasons, while the Wikipedia models were more accurate in the remaining 2. These analyses and comparisons were carried out on GFT data that was retrospectively adjusted by Google after large discrepancies between its estimates and CDC ILI data were found after the 2012–2013 influenza season, which was more severe than normal. Even with this retrospective adjustment in GFT model parameters, the peak value estimated by GFT for the 2012–2013 is more than 2.3-times exaggerated (6.04%) compared to CDC data, and was also estimated to be 4 weeks later than it actually was. For this same period, the Mf model was able to accurately estimate the timing of the peak, and its estimation was within 0.76% compared to the CDC data. This study is unique in that it is the first scientific investigation, to the authors' knowledge, into the harnessing of Wikipedia usage data over time to estimate the burden of disease in a population. While Google keeps GFT model parameters confidential, the Wikipedia article utilization data in these analyses are freely available and are open to be modified and improved upon by anyone. Although it has not been investigated here, there is potential for this method to be altered for the monitoring of other health-related issues such as heart disease, diabetes, sexually transmitted infections, and others. While the above mentioned conditions do not have the same time-varying component as influenza, overall burden of disease may potentially be estimated based on the number of people visiting Wikipedia articles of interest. This is an open method that can be further developed by researchers to investigate the relationship between Wikipedia article views and many factors of interest to public health. Data regarding Wikipedia page views is updated and available each hour, though data in this study has been aggregated to the day level, and then further aggregated to the week level. This was done so that one week of Wikipedia data matched one week of CDC's ILI estimate. In practice, if this Wikipedia based ILI surveillance system were to be implemented on a more permanent basis, it is possible that updates to the Wikipedia-estimated proportion of ILI activity in the United States could be available on a daily or even hourly basis, although this application has not yet been explored. It is hypothesized that hourly updates may have trouble dealing with periods of low viewing activity, such as nighttime and normal sleeping hours, and that the benefit of an hourly update versus a daily update might not be worth the effort involved in its perpetuation. Daily estimates are likely to be of greater use than hourly and hold potential for use as a tool for detecting outbreaks in real-time, by creating an alert when the daily number of Wikipedia article views spikes over a set threshold. As with any study using non-traditional sources of information to make estimations or predictions, there is always some measure of noise in the gathered information. For instance, the number of Wikipedia article views used in this study represent all instances of article views for the English language Wikipedia website. As such, while the largest proportion of these article views comes from the United States (41%, with the next largest location being the United Kingdom representing 11%), the remaining 59% of views come from other countries where English is used, including Australia, the United Kingdom, Canada, India, etc. Since Wikipedia does not make the location of each article visitor readily available, this makes the relationship between article views and ILI activity in the United States less reliable than if the article view data was from the United States alone. To investigate this bias, it may be of interest to replicate this study using data that is country and language specific. For instance, obtaining Wikipedia article view information for articles that exist only on the Italian language Wikipedia website and comparing that data to specific Italian ILI activity data. Alternatively, the timing and intensities of influenza seasons in English-Wikipedia-using countries apart from the United States could be investigated as potential explanations of model performance. Depending on the timing of influenza activity in other countries, their residents' Wikipedia usage could potentially bolster the presented Wikipedia-based model estimations (if their influenza seasons are similar to that of the United States), or it could negatively impact estimations (if their influenza seasons are not similar to those of the United States). This is an interesting method of comparison and may potentially be explored in future iterations of this method. If these models continue to estimate real-time ILI activity accurately, there is potential for this method to be used to predict timing and intensity in upcoming weeks. While re-purposing these models could potentially be a significant undertaking, we are interested in pursing this avenue of investigation in future works. There has been much discussion in popular media recently about the potential future directions of Wikipedia. It has been noted in several papers and reviews that the number of active Wikipedia editors has been slowly decreasing over the past 6 years, from its peak of more than 51,000 is 2007 to approximately 31,000 in the summer of 2013. [19], [28] It has been speculated that the efforts made by the Wikimedia Foundation and it's core group of dedicated volunteers to create a more reliable, trustworthy corpus of information has limited the ability of new editors to edit or create new articles, thereby decreasing the likelihood that a new contributor will become a trusted source of information. Compounding this decrease in active editors, it has become increasingly evident that the vast majority of articles on the English Wikipedia website are both male and Western and European-centric, with comparatively few articles dealing with highly female-oriented topics or other geographic areas. Despite these concerns, the articles relating to influenza that have been investigated in this study are within the scope of the type of Wikipedia articles that are routinely and adequately maintained by long-time editors. The authors hypothesize that any decreases in the number of editors in the Wikimedia domain are unlikely to create significant changes in viewership of the articles of interest for estimating or predicting influenza-like illness, and therefore should not contribute meaningfully to the pursuit of this type of surveillance. Due to an error in Wikipedia data collection, there were no article view data available between July 13, 2008–July 31, 2008, inclusive, resulting in a time gap of just over 2.5 weeks. Fortunately, this time gap occurred in a traditionally low ILI prevalence time of year, and is not suspected to meaningfully impact analyses. The application of Wikipedia article view data has been demonstrated to be effective at estimating the level of ILI activity in the US, when compared to CDC data. Wikipedia article view data is available daily (and hourly, if necessary), and can provide a reliable estimate of ILI activity up to 2 weeks in advance of traditional ILI reporting. This study exemplifies how non-traditional data sources may be tapped to provide valuable public health related insights and, with further improvement and validation, could potentially be implemented as an automatic sentinel surveillance system for any number of disease or conditions of interest as a supplement to more traditional surveillance systems.
                Bookmark

                Author and article information

                Contributors
                Journal
                PeerJ
                PeerJ
                PeerJ
                PeerJ
                PeerJ
                PeerJ Inc. (San Francisco, USA )
                2167-8359
                25 June 2018
                2018
                : 6
                : e5134
                Affiliations
                [1 ]Department of Epidemiology, School of Public Health, China Medical University , Shenyang, Liaoning, China
                [2 ]Department of Mathematics, School of Fundamental Sciences, China Medical University , Shenyang, Liaoning, China
                Author information
                http://orcid.org/0000-0003-0190-7301
                http://orcid.org/0000-0001-7648-3625
                Article
                5134
                10.7717/peerj.5134
                6022725
                29967755
                7a11728b-6e2f-49ef-8ea5-65857a4dee7a
                © 2018 Liang et al.

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

                History
                : 5 April 2018
                : 8 June 2018
                Funding
                Funded by: National Natural Science Foundation of China
                Award ID: 71573275
                This work was supported by the National Natural Science Foundation of China (No. 71573275). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Health Policy
                Infectious Diseases

                svm regression model,seasonal influenza,baidu search query,liaoning,flu surveillance system

                Comments

                Comment on this article