Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

There is interest in using convolutional neural networks (CNNs) to analyze medical imaging to provide computer-aided diagnosis (CAD). Recent work has suggested that image classification CNNs may not generalize to new data as well as previously believed. We assessed how well CNNs generalized across three hospital systems for a simulated pneumonia screening task.

Methods and findings

A cross-sectional design with multiple model training cohorts was used to evaluate model generalizability to external sites using split-sample validation. A total of 158,323 chest radiographs were drawn from three institutions: National Institutes of Health Clinical Center (NIH; 112,120 from 30,805 patients), Mount Sinai Hospital (MSH; 42,396 from 12,904 patients), and Indiana University Network for Patient Care (IU; 3,807 from 3,683 patients). These patient populations had an age mean (SD) of 46.9 years (16.6), 63.2 years (16.5), and 49.6 years (17) with a female percentage of 43.5%, 44.8%, and 57.3%, respectively. We assessed individual models using the area under the receiver operating characteristic curve (AUC) for radiographic findings consistent with pneumonia and compared performance on different test sets with DeLong’s test. The prevalence of pneumonia was high enough at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%) that merely sorting by hospital system achieved an AUC of 0.861 (95% CI 0.855–0.866) on the joint MSH–NIH dataset. Models trained on data from either NIH or MSH had equivalent performance on IU ( P values 0.580 and 0.273, respectively) and inferior performance on data from each other relative to an internal test set (i.e., new data from within the hospital system used for training data; P values both <0.001). The highest internal performance was achieved by combining training and test data from MSH and NIH (AUC 0.931, 95% CI 0.927–0.936), but this model demonstrated significantly lower external performance at IU (AUC 0.815, 95% CI 0.745–0.885, P = 0.001). To test the effect of pooling data from sites with disparate pneumonia prevalence, we used stratified subsampling to generate MSH–NIH cohorts that only differed in disease prevalence between training data sites. When both training data sites had the same pneumonia prevalence, the model performed consistently on external IU data ( P = 0.88). When a 10-fold difference in pneumonia rate was introduced between sites, internal test performance improved compared to the balanced model (10× MSH risk P < 0.001; 10× NIH P = 0.002), but this outperformance failed to generalize to IU (MSH 10× P < 0.001; NIH 10× P = 0.027). CNNs were able to directly detect hospital system of a radiograph for 99.95% NIH (22,050/22,062) and 99.98% MSH (8,386/8,388) radiographs. The primary limitation of our approach and the available public data is that we cannot fully assess what other factors might be contributing to hospital system–specific biases.

Conclusion

Pneumonia-screening CNNs achieved better internal than external performance in 3 out of 5 natural comparisons. When models were trained on pooled data from sites with different pneumonia prevalence, they performed better on new pooled data from these sites but not on external data. CNNs robustly identified hospital system and department within a hospital, which can have large differences in disease burden and may confound predictions.

Abstract

Eric Oermann and colleagues ask whether a DL-based model for pneumonia detection performs well in external validation and consider the effects of hospital system–specific biases.

Author summary

Why was this study done?

Early results in using convolutional neural networks (CNNs) on X-rays to diagnose disease have been promising, but it has not yet been shown that models trained on X-rays from one hospital or one group of hospitals will work equally well at different hospitals.
Before these tools are used for computer-aided diagnosis in real-world clinical settings, we must verify their ability to generalize across a variety of hospital systems.

What did the researchers do and find?

A cross-sectional design was used to train and evaluate pneumonia screening CNNs on 158,323 chest X-rays from the National Institutes of Health Clinical Center (NIH; n = 112,120 from 30,805 patients), Mount Sinai Hospital (42,396 from 12,904 patients), and Indiana University Network for Patient Care ( n = 3,807 from 3,683 patients).
In 3 out of 5 natural comparisons, performance on chest X-rays from outside hospitals was significantly lower than on held-out X-rays from the original hospital system.
CNNs were able to detect where a radiograph was acquired (hospital system, hospital department) with extremely high accuracy and calibrate predictions accordingly.

What do these findings mean?

The performance of CNNs in diagnosing diseases on X-rays may reflect not only their ability to identify disease-specific imaging findings on X-rays but also their ability to exploit confounding information.
Estimates of CNN performance based on test data from hospital systems used for model training may overstate their likely real-world performance.

Related collections

Most cited references 3

Record: found
Abstract: found
Article: not found

Preparing a collection of radiology examinations for distribution and retrieval.

Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman … (2016)

Clinical documents made available for secondary use play an increasingly important role in discovery of clinical knowledge, development of research methods, and education. An important step in facilitating secondary use of clinical document collections is easy access to descriptions and samples that represent the content of the collections. This paper presents an approach to developing a collection of radiology examinations, including both the images and radiologist narrative reports, and making them publicly available in a searchable database.

0 comments Cited 132 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

CONSORT 2010 statement: extension checklist for reporting within person randomised trials

Nikolaos Pandis, Bryan Chung, Roberta Scherer … (2017)

Evidence shows that the quality of reporting of randomised controlled trials (RCTs) is not optimal. The lack of transparent reporting impedes readers from judging the reliability and validity of trial findings and researchers from extracting information for systematic reviews and results in research waste. The Consolidated Standards of Reporting Trials (CONSORT) statement was developed to improve the reporting of RCTs. Within person trials are used for conditions that can affect two or more body sites, and are a useful and efficient tool because the comparisons between interventions are within people. Such trials are most commonly conducted in ophthalmology, dentistry, and dermatology. The reporting of within person trials has, however, been variable and incomplete, hindering their use in clinical decision making and by future researchers. This document presents the CONSORT extension to within person trials. It aims to facilitate the reporting of these trials. It extends 16 items of the CONSORT 2010 checklist and introduces a modified flowchart and baseline table to enhance transparency. Examples of good reporting and evidence based rationale for CONSORT within person checklist items are provided.

0 comments Cited 69 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

MABAL: a Novel Deep-Learning Architecture for Machine-Assisted Bone Age Labeling

Simukayi Mutasa, Peter Chang, Carrie Ruzal-Shapiro … (2018)

0 comments Cited 17 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

John R. Zech:

ORCID: http://orcid.org/0000-0003-1317-8951

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Marcus A. Badgeley:

ORCID: http://orcid.org/0000-0001-8064-9050

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Manway Liu:

ORCID: http://orcid.org/0000-0002-0991-3741

Role: Formal analysisRole: Writing – review & editing

Anthony B. Costa:

ORCID: http://orcid.org/0000-0002-2202-6450

Role: Data curationRole: Project administrationRole: ResourcesRole: Writing – review & editing

Joseph J. Titano: Role: Data curationRole: Project administrationRole: Writing – review & editing

Eric Karl Oermann:

ORCID: http://orcid.org/0000-0002-1876-5963

Role: ConceptualizationRole: Data curationRole: MethodologyRole: SupervisionRole: Writing – original draftRole: Writing – review & editing

Aziz Sheikh: Role: Academic Editor

Journal

Journal ID (nlm-ta): PLoS Med

Journal ID (iso-abbrev): PLoS Med

Journal ID (publisher-id): plos

Journal ID (pmc): plosmed

Title: PLoS Medicine

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1549-1277

ISSN (Electronic): 1549-1676

Publication date (Electronic): 6 November 2018

Publication date Collection: November 2018

Volume: 15

Issue: 11

Electronic Location Identifier: e1002683

Affiliations

[1 ] Department of Medicine, California Pacific Medical Center, San Francisco, California, United States of America

[2 ] Verily Life Sciences, South San Francisco, California, United States of America

[3 ] Department of Neurological Surgery, Icahn School of Medicine, New York, New York, United States of America

[4 ] Department of Radiology, Icahn School of Medicine, New York, New York, United States of America

Edinburgh University, UNITED KINGDOM

Author notes

I have read the journal's policy and the authors of this manuscript have the following competing interests: MAB and ML are currently employees at Verily Life Sciences, which played no role in the research and has no commercial interest in it. EKO and ABC receive funding from Intel for unrelated work.

* E-mail: eric.oermann@ 123456mountsinai.org

Author information

John R. Zech http://orcid.org/0000-0003-1317-8951

Marcus A. Badgeley http://orcid.org/0000-0001-8064-9050

Manway Liu http://orcid.org/0000-0002-0991-3741

Anthony B. Costa http://orcid.org/0000-0002-2202-6450

Eric Karl Oermann http://orcid.org/0000-0002-1876-5963

Article

Publisher ID: PMEDICINE-D-18-01277

DOI: 10.1371/journal.pmed.1002683

PMC ID: 6219764

PubMed ID: 30399157

SO-VID: 9280d009-5fdc-4e8b-a74d-918125728e30

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 10 April 2018

Date accepted : 28 September 2018

Page count

Figures: 3, Tables: 2, Pages: 17

Funding

Funded by: funder-id http://dx.doi.org/10.13039/100007277, Icahn School of Medicine at Mount Sinai;

Award Recipient : Joseph J. Titano

The Department of Radiology at the Icahn School of Medicine at Mount Sinai ( http://icahn.mssm.edu/about/departments/radiology) supported this project financially via internal department funding (author JJT). No other authors received specific funding for this work. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

Data Availability The code pipeline used to train and test the model across multiple institutions is available at https://github.com/jrzech/cxr-generalize. The NIH ChestX-ray14 dataset was curated and made publicly available by the National Institutes of Health (NIH) Clinical Center ( https://nihcc.app.box.com/v/ChestXray-NIHCC). The Open-I dataset of chest radiographs from the Indiana University Hospital network was curated and made publicly available by the National Library of Medicine, NIH ( https://openi.nlm.nih.gov/faq.php). Retrospective data used in this study from Mount Sinai Health System cannot be released under the terms of our Institutional Review Board approval to protect patient confidentiality. Researchers interested in accessing Mount Sinai data through the Imaging Research Warehouse Initiative may contact Zahi Fayad, PhD at zahi.fayad@ 123456mssm.edu .

ScienceOpen disciplines: Medicine

Data availability:

ScienceOpen disciplines: Medicine

Comments

Comment on this article

scite_

Cited by 306

See all cited by

Most referenced authors 985

See all reference authors

- Version 1

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study

Read this article at

Abstract

Background

Methods and findings

Conclusion

Abstract

Author summary

Why was this study done?

What did the researchers do and find?

What do these findings mean?

Related collections

Pneumonia, sex, and the environment

Most cited references 3

Preparing a collection of radiology examinations for distribution and retrieval.

CONSORT 2010 statement: extension checklist for reporting within person randomised trials

MABAL: a Novel Deep-Learning Architecture for Machine-Assisted Bone Age Labeling

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 87

Cited by 306

Most referenced authors 985