14
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Development and Validation of a Machine Learning Model Using Administrative Health Data to Predict Onset of Type 2 Diabetes

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Key Points

          Question

          Can a machine learning model trained on routinely collected administrative health data be used to accurately predict the onset of type 2 diabetes at the population level?

          Findings

          In this decision analytical model study of 2.1 million residents in Ontario, Canada, a machine learning model was developed with high discrimination, population-level calibration, and calibration across population subgroups.

          Meaning

          Study results suggest that machine learning and administrative health data can be used to create population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions for diabetes prevention.

          Abstract

          This decision analytical model study assesses whether a machine learning model trained on routinely collected administrative health data from a single-payer health system in Canada can be used to predict the onset of type 2 diabetes in the population.

          Abstract

          Importance

          Systems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions.

          Objective

          To develop and validate a population-level machine learning model for predicting type 2 diabetes 5 years before diabetes onset using administrative health data.

          Design, Setting, and Participants

          This decision analytical model study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada, between January 1, 2006, and December 31, 2016. A gradient boosting decision tree model was trained on data from 1 657 395 patients, validated on 243 442 patients, and tested on 236 506 patients. Costs associated with each patient were estimated using a validated costing algorithm. Data were analyzed from January 1, 2006, to December 31, 2016.

          Exposures

          A random sample of 2 137 343 residents of Ontario without type 2 diabetes was obtained at study start time. More than 300 features from data sets capturing demographic information, laboratory measurements, drug benefits, health care system interactions, social determinants of health, and ambulatory care and hospitalization records were compiled over 2-year patient medical histories to generate quarterly predictions.

          Main Outcomes and Measures

          Discrimination was assessed using the area under the receiver operating characteristic curve statistic, and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 US dollars.

          Results

          This study trained a gradient boosting decision tree model on data from 1 657 395 patients (12 900 257 instances; 6 666 662 women [51.7%]). The developed model achieved a test area under the curve of 80.26 (range, 80.21-80.29), demonstrated good calibration, and was robust to sex, immigration status, area-level marginalization with regard to material deprivation and race/ethnicity, and low contact with the health care system. The top 5% of patients predicted as high risk by the model represented 26% of the total annual diabetes cost in Ontario.

          Conclusions and Relevance

          In this decision analytical model study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data. These results suggest that the model could be used to inform decision-making for population health planning and diabetes prevention.

          Related collections

          Most cited references69

          • Record: found
          • Abstract: not found
          • Article: not found

          Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies.

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            From local explanations to global understanding with explainable AI for trees

            Tree-based machine learning models such as random forests, decision trees, and gradient boosted trees are popular non-linear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here, we improve the interpretability of tree-based models through three main contributions: 1) The first polynomial time algorithm to compute optimal explanations based on game theory. 2) A new type of explanation that directly measures local feature interaction effects. 3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to i) identify high magnitude but low frequency non-linear mortality risk factors in the US population, ii) highlight distinct population sub-groups with shared risk characteristics, iii) identify non-linear interaction effects among risk factors for chronic kidney disease, and iv) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Exact game-theoretic explanations for ensemble tree-based predictions that guarantee desirable properties.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

              Black box machine learning models are currently being used for high stakes decision-making throughout society, causing problems throughout healthcare, criminal justice, and in other domains. People have hoped that creating methods for explaining these black box models will alleviate some of these problems, but trying to explain black box models, rather than creating models that are interpretable in the first place, is likely to perpetuate bad practices and can potentially cause catastrophic harm to society. There is a way forward - it is to design models that are inherently interpretable. This manuscript clarifies the chasm between explaining black boxes and using inherently interpretable models, outlines several key reasons why explainable black boxes should be avoided in high-stakes decisions, identifies challenges to interpretable machine learning, and provides several example applications where interpretable models could potentially replace black box models in criminal justice, healthcare, and computer vision.
                Bookmark

                Author and article information

                Journal
                JAMA Netw Open
                JAMA Netw Open
                JAMA Netw Open
                JAMA Network Open
                American Medical Association
                2574-3805
                25 May 2021
                May 2021
                25 May 2021
                : 4
                : 5
                : e2111315
                Affiliations
                [1 ]Layer 6 AI, Toronto, Ontario, Canada
                [2 ]Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
                [3 ]Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
                [4 ]Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
                [5 ]Temerty Centre for Artificial Intelligence Research and Education in Medicine, University of Toronto, Toronto, Ontario, Canada
                [6 ]Vector Institute, Toronto, Ontario, Canada
                [7 ]Institute of Clinical Evaluative Sciences (ICES), Toronto, Ontario, Canada
                [8 ]Institute for Better Health, Trillium Health Partners, Mississauga, Ontario, Canada
                Author notes
                Article Information
                Accepted for Publication: April 1, 2021.
                Published: May 25, 2021. doi:10.1001/jamanetworkopen.2021.11315
                Open Access: This is an open access article distributed under the terms of the CC-BY License. © 2021 Ravaut M et al. JAMA Network Open.
                Corresponding Author: Laura C. Rosella, PhD, Dalla Lana School of Public Health, University of Toronto, 155 College St, Ste 672, Toronto, ON M5T 3M7, Canada ( laura.rosella@ 123456utoronto.ca ).
                Author Contributions: Mr Ravaut had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
                Concept and design: Ravaut, Sadeghi, Leung, Volkovs, Poutanen, Rosella.
                Acquisition, analysis, or interpretation of data: Ravaut, Harish, Sadeghi, Leung, Volkovs, Kornas, Watson, Rosella.
                Drafting of the manuscript: Ravaut, Harish, Sadeghi, Leung, Rosella.
                Critical revision of the manuscript for important intellectual content: All authors.
                Statistical analysis: Ravaut, Sadeghi, Leung, Volkovs, Rosella.
                Obtained funding: Poutanen, Rosella.
                Administrative, technical, or material support: Ravaut, Sadeghi, Kornas, Watson, Poutanen.
                Supervision: Sadeghi, Volkovs, Poutanen, Rosella.
                Conflict of Interest Disclosures: Mr Harish reported receiving grants from the Canadian Institutes of Health Research (scholarship), the University of Toronto and Province of Ontario (scholarship), and the Vector Institute (scholarship) during the conduct of the study. Dr Rosella reported receiving grants from Canada Research Chairs, the Connaught Global Challenge Award, and the New Frontiers in Research Fund during the conduct of the study. No other disclosures were reported.
                Funding/Support: This study was funded by grant FRN 72055702 from the Connaught Global Challenge Award 2018/2019 (Dr Rosella). Dr Rosella was supported by grant FRN 72060091 from the Canada Research Chair in Population Health Analytics. Mr Harish was supported by the Ontario Graduate Scholarship and Canadian Institutes of Health Research Banting and Best Canada Graduate Scholarship-Master’s awards and the Vector Institute Postgraduate Affiliate program award.
                Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
                Disclaimer: The analyses, conclusions, opinions, and statements expressed herein are solely those of the authors and do not reflect those of the funding or data sources; no endorsement is intended or should be inferred.
                Additional Contributions: We thank IMS Borgan Inc for use of their Drug Information Database. For modeling infrastructure, we thank the ICES information technology staff.
                Additional Information: This study was supported by ICES, which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care. Parts of this material are based on data and information compiled and provided by the Ministry of Health and Long-Term Care; Canadian Institutes for Health Information, Immigration, Refugees, and Citizenship Canada; and the Office of the Registrar General. The data set for this study is held securely in coded form at ICES. Although data-sharing agreements prohibit ICES from making the data set publicly available, access may be granted to those who meet prespecified criteria for confidential access, available at https://www.ices.on.ca/DAS. The full data set creation plan is available from the authors upon request. The data for this study was prepared with custom code from ICES using SAS Enterprise software, version 6.1 (SAS Institute Inc). These data were later analyzed with custom code from Layer 6 AI in the Java 8 and Python 3.6 programming languages. The analytic code is available from the authors upon request, understanding that the computer programs may rely upon coding templates or macros that are unique to ICES and this data and thus may require modification.
                Article
                zoi210330
                10.1001/jamanetworkopen.2021.11315
                8150694
                34032855
                f8f1c71f-81f9-4ef4-9eef-b4a513f20ee7
                Copyright 2021 Ravaut M et al. JAMA Network Open.

                This is an open access article distributed under the terms of the CC-BY License.

                History
                : 20 November 2020
                : 1 April 2021
                Categories
                Research
                Original Investigation
                Online Only
                Public Health

                Comments

                Comment on this article