+1 Recommend
1 collections

      To submit your manuscript, please click here

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India

      , MSPH 1 , , , MBBS, MD, MPH, DrPH 1 , , PhD 2 , , BAMS, MPH 3 , , MPhil, MA 3 , , MHS, PhD 1 , 4
      (Reviewer), (Reviewer)
      JMIR Research Protocols
      JMIR Publications
      quality assurance, household survey data, machine learning, monitoring, real-time data, data analytics

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality.


          This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics.


          In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops.


          Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020.


          Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers.

          International Registered Report Identifier (IRRID)


          Related collections

          Most cited references15

          • Record: found
          • Abstract: not found
          • Article: not found

          Research Synthesis: The Practice of Cognitive Interviewing

            • Record: found
            • Abstract: not found
            • Article: not found
            Is Open Access

            Detecting the Boundaries of Urban Areas in India: A Dataset for Pixel-Based Image Classification in Google Earth Engine

              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Are stage-based health information messages effective and good value for money in improving maternal newborn and child health outcomes in India? Protocol for an individually randomized controlled trial

              Background Evidence is limited on the effectiveness of mobile health programs which provide stage-based health information messages to pregnant and postpartum women. Kilkari is an outbound service that delivers weekly, stage-based audio messages about pregnancy, childbirth, and childcare directly to families in 13 states across India on their mobile phones. In this protocol we outline methods for measuring the effectiveness and cost-effectiveness of Kilkari. Methods The study is an individually randomized controlled trial (iRCT) with a parallel, partially concurrent, and unblinded design. Five thousand pregnant women will be enrolled from four districts of Madhya Pradesh and randomized to an intervention or control arm. The women in the intervention arm will receive Kilkari messages while the control group will not receive any Kilkari messages as part of the study. Women in both arms will be followed from enrollment in the second and early third trimesters of pregnancy until one year after delivery. Differences in primary outcomes across study arms including early and exclusive breastfeeding and the adoption of modern contraception at 1 year postpartum will be assessed using intention to treat methodology. Surveys will be administered at baseline and endline containing modules on phone ownership, geographical and demographic characteristics, knowledge, practices, respectful maternity care, and coverage for antenatal care, delivery, and postnatal care. In-depth interviews and focus group discussions will be carried out to understand user perceptions of Kilkari, and more broadly, experiences providing phone numbers and personal health information to health care providers. Costs and consequences will be estimated from a societal perspective for the 2018–2019 analytic time horizon. Discussion Kilkari is the largest maternal messaging program, in terms of absolute numbers, currently being implemented globally. Evaluations of similar initiatives elsewhere have been small in scale and focused on summative outcomes, presenting limited evidence on individual exposure to content. Drawing upon system-generated data, we explore linkages between successful receipt of calls, user engagement with calls, and reported outcomes. This is the first study of its kind in India and is anticipated to provide the most robust and comprehensive evidence to date on maternal messaging programs globally. Trial registration Clinicaltrials.gov, 90075552, NCT03576157. Registered on 22 June 2018. Electronic supplementary material The online version of this article (10.1186/s13063-019-3369-5) contains supplementary material, which is available to authorized users.

                Author and article information

                JMIR Res Protoc
                JMIR Res Protoc
                JMIR Research Protocols
                JMIR Publications (Toronto, Canada )
                August 2020
                5 August 2020
                : 9
                : 8
                : e17619
                [1 ] Department of International Health Johns Hopkins Bloomberg School of Public Health Baltimore, MD United States
                [2 ] Faculty of Health Sciences Department of Integrative Biomedical Sciences, & Member of the Institute of Infectious Disease and Molecular Medicine University of Cape Town Cape Town South Africa
                [3 ] Oxford Policy Management New Delhi India
                [4 ] Division of Epidemiology and Biostatistics School of Public Health and Family Medicine University of Cape Town Cape Town South Africa
                Author notes
                Corresponding Author: Neha Shah nshah67@ 123456jh.edu
                Author information
                ©Neha Shah, Diwakar Mohan, Jean Juste Harisson Bashingwa, Osama Ummer, Arpita Chakraborty, Amnesty E. LeFevre. Originally published in JMIR Research Protocols (http://www.researchprotocols.org), 05.08.2020.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on http://www.researchprotocols.org, as well as this copyright and license information must be included.

                : 29 December 2019
                : 13 March 2020
                : 18 March 2020
                : 13 June 2020

                quality assurance,household survey data,machine learning,monitoring,real-time data,data analytics


                Comment on this article