+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Calibration: the Achilles heel of predictive analytics

      1 , 2 , 6 , , 3 , 6 , 2 , 4 , 6 , 1 , 5 , 2 , 6 , On behalf of Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative 6
      BMC Medicine
      BioMed Central
      Calibration, Risk prediction models, Predictive analytics, Overfitting, Heterogeneity, Model performance

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          The assessment of calibration performance of risk prediction models based on regression or more flexible machine learning algorithms receives little attention.

          Main text

          Herein, we argue that this needs to change immediately because poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making. We summarize how to avoid poor calibration at algorithm development and how to assess calibration at algorithm validation, emphasizing balance between model complexity and the available sample size. At external validation, calibration curves require sufficiently large samples. Algorithm updating should be considered for appropriate support of clinical practice.


          Efforts are required to avoid poor calibration when developing prediction models, to evaluate calibration when validating models, and to update models when indicated. The ultimate aim is to optimize the utility of predictive analytics for shared decision-making and patient counseling.

          Related collections

          Most cited references21

          • Record: found
          • Abstract: found
          • Article: not found

          A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

          The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature.
            • Record: found
            • Abstract: found
            • Article: not found

            A calibration hierarchy for risk models was defined: from utopia to empirical data.

            Calibrated risk models are vital for valid decision support. We define four levels of calibration and describe implications for model development and external validation of predictions.
              • Record: found
              • Abstract: found
              • Article: not found

              A Deep Learning Mammography-based Model for Improved Breast Cancer Risk Prediction

              Background Mammographic density improves the accuracy of breast cancer risk models. However, the use of breast density is limited by subjective assessment, variation across radiologists, and restricted data. A mammography-based deep learning (DL) model may provide more accurate risk prediction. Purpose To develop a mammography-based DL breast cancer risk model that is more accurate than established clinical breast cancer risk models. Materials and Methods This retrospective study included 88 994 consecutive screening mammograms in 39 571 women between January 1, 2009, and December 31, 2012. For each patient, all examinations were assigned to either training, validation, or test sets, resulting in 71 689, 8554, and 8751 examinations, respectively. Cancer outcomes were obtained through linkage to a regional tumor registry. By using risk factor information from patient questionnaires and electronic medical records review, three models were developed to assess breast cancer risk within 5 years: a risk-factor-based logistic regression model (RF-LR) that used traditional risk factors, a DL model (image-only DL) that used mammograms alone, and a hybrid DL model that used both traditional risk factors and mammograms. Comparisons were made to an established breast cancer risk model that included breast density (Tyrer-Cuzick model, version 8 [TC]). Model performance was compared by using areas under the receiver operating characteristic curve (AUCs) with DeLong test (P < .05). Results The test set included 3937 women, aged 56.20 years ± 10.04. Hybrid DL and image-only DL showed AUCs of 0.70 (95% confidence interval [CI]: 0.66, 0.75) and 0.68 (95% CI: 0.64, 0.73), respectively. RF-LR and TC showed AUCs of 0.67 (95% CI: 0.62, 0.72) and 0.62 (95% CI: 0.57, 0.66), respectively. Hybrid DL showed a significantly higher AUC (0.70) than TC (0.62; P < .001) and RF-LR (0.67; P = .01). Conclusion Deep learning models that use full-field mammograms yield substantially improved risk discrimination compared with the Tyrer-Cuzick (version 8) model. © RSNA, 2019 Online supplemental material is available for this article. See also the editorial by Sitek and Wolfe in this issue.

                Author and article information

                BMC Med
                BMC Med
                BMC Medicine
                BioMed Central (London )
                16 December 2019
                16 December 2019
                : 17
                [1 ]ISNI 0000 0001 0668 7884, GRID grid.5596.f, Department of Development and Regeneration, , KU Leuven, ; Herestraat 49 box 805, 3000 Leuven, Belgium
                [2 ]ISNI 0000000089452978, GRID grid.10419.3d, Department of Biomedical Data Sciences, , Leiden University Medical Center, ; Leiden, Netherlands
                [3 ]ISNI 0000 0004 1936 7291, GRID grid.7107.1, Medical Statistics Team, Institute of Applied Health Sciences, School of Medicine, Medical Sciences and Nutrition, , University of Aberdeen, ; Aberdeen, UK
                [4 ]ISNI 0000000089452978, GRID grid.10419.3d, Department of Clinical Epidemiology, , Leiden University Medical Center, ; Leiden, Netherlands
                [5 ]ISNI 0000 0001 0481 6099, GRID grid.5012.6, Department of Epidemiology, CAPHRI Care and Public Health Research Institute, , Maastricht University, ; Maastricht, Netherlands
                [6 ] http://www.stratos-initiative.org
                © The Author(s). 2019

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                Funded by: FundRef http://dx.doi.org/10.13039/501100003130, Fonds Wetenschappelijk Onderzoek;
                Award ID: G0B4716N
                Funded by: FundRef http://dx.doi.org/10.13039/501100004497, Onderzoeksraad, KU Leuven;
                Award ID: C24/15/037
                Custom metadata
                © The Author(s) 2019

                calibration,risk prediction models,predictive analytics,overfitting,heterogeneity,model performance


                Comment on this article