Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

A regular task by developers and users of synthetic data generation (SDG) methods is to evaluate and compare the utility of these methods. Multiple utility metrics have been proposed and used to evaluate synthetic data. However, they have not been validated in general or for comparing SDG methods.

Objective

This study evaluates the ability of common utility metrics to rank SDG methods according to performance on a specific analytic workload. The workload of interest is the use of synthetic data for logistic regression prediction models, which is a very frequent workload in health research.

Methods

We evaluated 6 utility metrics on 30 different health data sets and 3 different SDG methods (a Bayesian network, a Generative Adversarial Network, and sequential tree synthesis). These metrics were computed by averaging across 20 synthetic data sets from the same generative model. The metrics were then tested on their ability to rank the SDG methods based on prediction performance. Prediction performance was defined as the difference between each of the area under the receiver operating characteristic curve and area under the precision-recall curve values on synthetic data logistic regression prediction models versus real data models.

Results

The utility metric best able to rank SDG methods was the multivariate Hellinger distance based on a Gaussian copula representation of real and synthetic joint distributions.

Conclusions

This study has validated a generative model utility metric, the multivariate Hellinger distance, which can be used to reliably rank competing SDG methods on the same data set. The Hellinger distance metric can be used to evaluate and compare alternate SDG methods.

Related collections

Most cited references 65

Record: found
Abstract: not found
Article: not found

The central role of the propensity score in observational studies for causal effects

Paul R. Rosenbaum, Donald B Rubin (1983)

0 comments Cited 1680 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

christodoulou Evangelia, M.A. Jie, Gary S. Collins … (2019)

The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature.

0 comments Cited 526 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Conference Proceedings: not found

The relationship between Precision-Recall and ROC curves

Jesse Davis, Mark Goadrich (2006)

0 comments Cited 289 times – based on 0 reviews

Bookmark

All references

Author and article information

Contributors

Khaled El Emam:

ORCID: https://orcid.org/0000-0003-3325-4149

School of Epidemiology and Public HealthUniversity of Ottawa401 Smyth RoadOttawa, ON, K1H 8L1Canada1 6137975412kelemam@ehealthinformation.ca

Journal

Journal ID (nlm-ta): JMIR Med Inform

Journal ID (iso-abbrev): JMIR Med Inform

Journal ID (publisher-id): JMI

Title: JMIR Medical Informatics

Publisher: JMIR Publications (Toronto, Canada )

ISSN (Electronic): 2291-9694

Publication date Collection: April 2022

Publication date (Electronic): 7 April 2022

Volume: 10

Issue: 4

Electronic Location Identifier: e35734

Affiliations

[1 ] School of Epidemiology and Public Health University of Ottawa Ottawa, ON Canada

[2 ] Children's Hospital of Eastern Ontario Research Institute Ottawa, ON Canada

[3 ] Replica Analytics Ltd Ottawa, ON Canada

[4 ] Open Source Research Collaboration Aarlberg Denmark

Author notes

Corresponding Author: Khaled El Emam kelemam@ 123456ehealthinformation.ca

Author information

Khaled El Emam https://orcid.org/0000-0003-3325-4149

Lucy Mosquera https://orcid.org/0000-0002-5289-8372

Xi Fang https://orcid.org/0000-0002-5571-7004

Alaa El-Hussuna https://orcid.org/0000-0002-0070-8362

Article

Publisher ID: v10i4e35734

DOI: 10.2196/35734

PMC ID: 9030990

PubMed ID: 35389366

SO-VID: 3382be81-dcf0-4f21-8f5d-258de0759c3b

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

History

Date received : 15 December 2021

Date revision requested : 4 January 2022

Date revision received : 27 January 2022

Date accepted : 13 February 2022

Comments

Comment on this article

scite_

Cited by 10

See all cited by

Most referenced authors 508

See all reference authors

Submit your digital health research with an established publisher
- celebrating 25 years of open access