1
views
0
recommends
+1 Recommend
2 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Validating a membership disclosure metric for synthetic health data

      research-article
      , ,
      JAMIA Open
      Oxford University Press
      synthetic data generation, data privacy, membership disclosure

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          One of the increasingly accepted methods to evaluate the privacy of synthetic data is by measuring the risk of membership disclosure. This is a measure of the F1 accuracy that an adversary would correctly ascertain that a target individual from the same population as the real data is in the dataset used to train the generative model, and is commonly estimated using a data partitioning methodology with a 0.5 partitioning parameter.

          Objective

          Validate the membership disclosure F1 score, evaluate and improve the parametrization of the partitioning method, and provide a benchmark for its interpretation.

          Materials and methods

          We performed a simulated membership disclosure attack on 4 population datasets: an Ontario COVID-19 dataset, a state hospital discharge dataset, a national health survey, and an international COVID-19 behavioral survey. Two generative methods were evaluated: sequential synthesis and a generative adversarial network. A theoretical analysis and a simulation were used to determine the correct partitioning parameter that would give the same F1 score as a ground truth simulated membership disclosure attack.

          Results

          The default 0.5 parameter can give quite inaccurate membership disclosure values. The proportion of records from the training dataset in the attack dataset must be equal to the sampling fraction of the real dataset from the population. The approach is demonstrated on 7 clinical trial datasets.

          Conclusions

          Our proposed parameterization, as well as interpretation and generative model training guidance provide a theoretically and empirically grounded basis for evaluating and managing membership disclosure risk for synthetic data.

          Related collections

          Most cited references56

          • Record: found
          • Abstract: found
          • Article: not found

          Wild-type KRAS is required for panitumumab efficacy in patients with metastatic colorectal cancer.

          Panitumumab, a fully human antibody against the epidermal growth factor receptor (EGFR), has activity in a subset of patients with metastatic colorectal cancer (mCRC). Although activating mutations in KRAS, a small G-protein downstream of EGFR, correlate with poor response to anti-EGFR antibodies in mCRC, their role as a selection marker has not been established in randomized trials. KRAS mutations were detected using polymerase chain reaction on DNA from tumor sections collected in a phase III mCRC trial comparing panitumumab monotherapy to best supportive care (BSC). We tested whether the effect of panitumumab on progression-free survival (PFS) differed by KRAS status. KRAS status was ascertained in 427 (92%) of 463 patients (208 panitumumab, 219 BSC). KRAS mutations were found in 43% of patients. The treatment effect on PFS in the wild-type (WT) KRAS group (hazard ratio [HR], 0.45; 95% CI: 0.34 to 0.59) was significantly greater (P < .0001) than in the mutant group (HR, 0.99; 95% CI, 0.73 to 1.36). Median PFS in the WT KRAS group was 12.3 weeks for panitumumab and 7.3 weeks for BSC. Response rates to panitumumab were 17% and 0%, for the WT and mutant groups, respectively. WT KRAS patients had longer overall survival (HR, 0.67; 95% CI, 0.55 to 0.82; treatment arms combined). Consistent with longer exposure, more grade III treatment-related toxicities occurred in the WT KRAS group. No significant differences in toxicity were observed between the WT KRAS group and the overall population. Panitumumab monotherapy efficacy in mCRC is confined to patients with WT KRAS tumors. KRAS status should be considered in selecting patients with mCRC as candidates for panitumumab monotherapy.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Unique in the Crowd: The privacy bounds of human mobility

            We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier's antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual's privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The National COVID Cohort Collaborative (N3C): Rationale, Design, Infrastructure, and Deployment

              Abstract Objective COVID-19 poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers. Methods The Clinical and Translational Science Award (CTSA) Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics. Organized in inclusive workstreams, in two months we created: legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access. Discussion The N3C has demonstrated that a multi-site collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multi-organizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19. LAY SUMMARY COVID-19 poses societal challenges that require expeditious data and knowledge sharing. Though medical records are abundant, they are largely inaccessible to outside researchers. Statistical, machine learning, and causal research are most successful with large datasets beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many clinical centers to reveal patterns in COVID-19 patients. To create N3C, the community had to overcome technical, regulatory, policy, and governance barriers to sharing patient-level clinical data. In less than 2 months, we developed solutions to acquire and harmonize data across organizations and created a secure data environment to enable transparent and reproducible collaborative research. We expect the N3C to help save lives by enabling collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care needs and thereby reduce the immediate and long-term impacts of COVID-19.
                Bookmark

                Author and article information

                Contributors
                Journal
                JAMIA Open
                JAMIA Open
                jamiaoa
                JAMIA Open
                Oxford University Press
                2574-2531
                December 2022
                11 October 2022
                11 October 2022
                : 5
                : 4
                : ooac083
                Affiliations
                Data Science, Replica Analytics Ltd. , Ottawa, Ontario, Canada
                School of Epidemiology and Public Health, University of Ottawa , Ottawa, Ontario, Canada
                Research Institute, Children’s Hospital of Eastern Ontario , Ottawa, Ontario, Canada
                Data Science, Replica Analytics Ltd. , Ottawa, Ontario, Canada
                Research Institute, Children’s Hospital of Eastern Ontario , Ottawa, Ontario, Canada
                Data Science, Replica Analytics Ltd. , Ottawa, Ontario, Canada
                Author notes
                Corresponding Author: Khaled El Emam, PhD, Research Institute, Children’s Hospital of Eastern Ontario, 401 Smyth Road, Ottawa, Ontario K1H 8L1, Canada; kelemam@ 123456ehealthinformation.ca
                Author information
                https://orcid.org/0000-0003-3325-4149
                Article
                ooac083
                10.1093/jamiaopen/ooac083
                9553223
                36238080
                ff13c4a7-1195-412b-a1a2-c252288a1898
                © The Author(s) 2022. Published by Oxford University Press on behalf of the American Medical Informatics Association.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 20 July 2022
                : 13 September 2022
                : 16 September 2022
                : 22 September 2022
                Page count
                Pages: 12
                Funding
                Funded by: Canadian Institutes for Health Research;
                Categories
                Research and Applications
                AcademicSubjects/SCI01530
                AcademicSubjects/MED00010
                AcademicSubjects/SCI01060

                synthetic data generation,data privacy,membership disclosure

                Comments

                Comment on this article