1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      Total Error in a Big Data World: Adapting the TSE Framework to Big Data

      1 , 2 , 3
      Journal of Survey Statistics and Methodology
      Oxford University Press (OUP)

      Read this article at

      ScienceOpenPublisher
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          While Big Data offers a potentially less expensive, less burdensome, and more timely alternative to survey data for producing a variety of statistics, it is not without error. The AAPOR Task Force on Big Data and others have called for researchers to evaluate the quality of Big Data using an approach similar to the total survey error (TSE) framework. However, differences in the construction of, access to, and overall data structure between survey data and Big Data make application of TSE difficult. In this article, we seek to develop the Total Error Framework (TEF), an extension of the TSE framework, to be (1) more inclusive and applicable to many types of Big Data, (2) comprehensive in that it considers “total” error, and (3) unified in that it allows researchers to compare errors in Big Data to errors in survey data. After outlining this framework, we then illustrate an application of TEF by comparing error in housing unit area (square footage) estimates collected in a survey (the 2015 Residential Energy Consumption Survey [RECS]) to those estimates found in three Big Data databases (Zillow.com, Acxiom, and CoreLogic).

          Related collections

          Most cited references20

          • Record: found
          • Abstract: not found
          • Article: not found

          Detecting influenza epidemics using search engine query data.

          Seasonal influenza epidemics are a major public health concern, causing tens of millions of respiratory illnesses and 250,000 to 500,000 deaths worldwide each year. In addition to seasonal influenza, a new strain of influenza virus against which no previous immunity exists and that demonstrates human-to-human transmission could result in a pandemic with millions of fatalities. Early detection of disease activity, when followed by a rapid response, can reduce the impact of both seasonal and pandemic influenza. One way to improve early detection is to monitor health-seeking behaviour in the form of queries to online search engines, which are submitted by millions of users around the world each day. Here we present a method of analysing large numbers of Google search queries to track influenza-like illness in a population. Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, we can accurately estimate the current level of weekly influenza activity in each region of the United States, with a reporting lag of about one day. This approach may make it possible to use search queries to detect influenza epidemics in areas with a large population of web search users.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            When Google got flu wrong.

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic

              Background Google Flu Trends (GFT) uses anonymized, aggregated internet search activity to provide near-real time estimates of influenza activity. GFT estimates have shown a strong correlation with official influenza surveillance data. The 2009 influenza virus A (H1N1) pandemic [pH1N1] provided the first opportunity to evaluate GFT during a non-seasonal influenza outbreak. In September 2009, an updated United States GFT model was developed using data from the beginning of pH1N1. Methodology/Principal Findings We evaluated the accuracy of each U.S. GFT model by comparing weekly estimates of ILI (influenza-like illness) activity with the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). For each GFT model we calculated the correlation and RMSE (root mean square error) between model estimates and ILINet for four time periods: pre-H1N1, Summer H1N1, Winter H1N1, and H1N1 overall (Mar 2009–Dec 2009). We also compared the number of queries, query volume, and types of queries (e.g., influenza symptoms, influenza complications) in each model. Both models' estimates were highly correlated with ILINet pre-H1N1 and over the entire surveillance period, although the original model underestimated the magnitude of ILI activity during pH1N1. The updated model was more correlated with ILINet than the original model during Summer H1N1 (r = 0.95 and 0.29, respectively). The updated model included more search query terms than the original model, with more queries directly related to influenza infection, whereas the original model contained more queries related to influenza complications. Conclusions Internet search behavior changed during pH1N1, particularly in the categories “influenza complications” and “term for influenza.” The complications associated with pH1N1, the fact that pH1N1 began in the summer rather than winter, and changes in health-seeking behavior each may have played a part. Both GFT models performed well prior to and during pH1N1, although the updated model performed better during pH1N1, especially during the summer months.
                Bookmark

                Author and article information

                Journal
                Journal of Survey Statistics and Methodology
                Oxford University Press (OUP)
                2325-0984
                2325-0992
                February 2020
                February 01 2020
                January 27 2020
                February 2020
                February 01 2020
                January 27 2020
                : 8
                : 1
                : 89-119
                Affiliations
                [1 ]Senior Research Survey Methodologist in the Program for Research in Survey Methodology, RTI International 701 13th St NW #750, Washington, DC 20005, USA
                [2 ]Distinguished Fellow, RTI International and Associate Director, The Odum Institute for Research in Social Science at the University of North Carolina at Chapel Hill, 3040 East Cornwallis Road, P.O. Box 12194, Research Triangle Park, NC27709-2194, USA
                [3 ]Team Leader, Office of Statistical Methods and Research, U.S. Energy Information Administration, 1000 Independence Ave. SW, Washington, DC 20585, USA
                Article
                10.1093/jssam/smz056
                a1eb5103-9e2a-4750-8868-cde521743ff8
                © 2020

                https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model

                History

                Comments

                Comment on this article