3
views
0
recommends
+1 Recommend
1 collections
    0
    shares

      Submit your digital health research with an established publisher
      - celebrating 25 years of open access

      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

      research-article

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods.

          Objective

          This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research.

          Methods

          We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models.

          Results

          The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions.

          Conclusions

          This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.

          Related collections

          Most cited references65

          • Record: found
          • Abstract: not found
          • Article: not found

          The central role of the propensity score in observational studies for causal effects

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

            The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature.
              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              The relationship between Precision-Recall and ROC curves

                Bookmark

                Author and article information

                Contributors
                Journal
                JMIR Med Inform
                JMIR Med Inform
                JMI
                JMIR Medical Informatics
                JMIR Publications (Toronto, Canada )
                2291-9694
                April 2022
                7 April 2022
                : 10
                : 4
                : e35734
                Affiliations
                [1 ] School of Epidemiology and Public Health University of Ottawa Ottawa, ON Canada
                [2 ] Children's Hospital of Eastern Ontario Research Institute Ottawa, ON Canada
                [3 ] Replica Analytics Ltd Ottawa, ON Canada
                [4 ] Open Source Research Collaboration Aarlberg Denmark
                Author notes
                Corresponding Author: Khaled El Emam kelemam@ 123456ehealthinformation.ca
                Author information
                https://orcid.org/0000-0003-3325-4149
                https://orcid.org/0000-0002-5289-8372
                https://orcid.org/0000-0002-5571-7004
                https://orcid.org/0000-0002-0070-8362
                Article
                v10i4e35734
                10.2196/35734
                9030990
                35389366
                3382be81-dcf0-4f21-8f5d-258de0759c3b
                ©Khaled El Emam, Lucy Mosquera, Xi Fang, Alaa El-Hussuna. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 07.04.2022.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

                History
                : 15 December 2021
                : 4 January 2022
                : 27 January 2022
                : 13 February 2022
                Categories
                Original Paper
                Original Paper

                synthetic data,data utility,data privacy,generative models,utility metric,synthetic data generation,logistic regression,model validation,medical informatics,binary prediction model,prediction model

                Comments

                Comment on this article