+1 Recommend
0 collections
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Design Characteristics of Studies Reporting the Performance of Artificial Intelligence Algorithms for Diagnostic Analysis of Medical Images: Results from Recently Published Papers

      Read this article at

          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.



          To evaluate the design characteristics of studies that evaluated the performance of artificial intelligence (AI) algorithms for the diagnostic analysis of medical images.

          Materials and Methods

          PubMed MEDLINE and Embase databases were searched to identify original research articles published between January 1, 2018 and August 17, 2018 that investigated the performance of AI algorithms that analyze medical images to provide diagnostic decisions. Eligible articles were evaluated to determine 1) whether the study used external validation rather than internal validation, and in case of external validation, whether the data for validation were collected, 2) with diagnostic cohort design instead of diagnostic case-control design, 3) from multiple institutions, and 4) in a prospective manner. These are fundamental methodologic features recommended for clinical validation of AI performance in real-world practice. The studies that fulfilled the above criteria were identified. We classified the publishing journals into medical vs. non-medical journal groups. Then, the results were compared between medical and non-medical journals.


          Of 516 eligible published studies, only 6% (31 studies) performed external validation. None of the 31 studies adopted all three design features: diagnostic cohort design, the inclusion of multiple institutions, and prospective data collection for external validation. No significant difference was found between medical and non-medical journals.


          Nearly all of the studies published in the study period that evaluated the performance of AI algorithms for diagnostic analysis of medical images were designed as proof-of-concept technical feasibility studies and did not have the design features that are recommended for robust validation of the real-world clinical performance of AI algorithms.

          Related collections

          Most cited references 36

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Deep Learning in Medical Imaging: General Overview

          The artificial neural network (ANN)–a machine learning technique inspired by the human neuronal synapse system–was introduced in the 1950s. However, the ANN was previously limited in its ability to solve actual problems, due to the vanishing gradient and overfitting problems with training of deep architecture, lack of computing power, and primarily the absence of sufficient data to train the computer system. Interest in this concept has lately resurfaced, due to the availability of big data, enhanced computing power with the current graphics processing units, and novel algorithms to train the deep neural network. Recent studies on this technology suggest its potentially to perform better than humans in some visual and auditory recognition tasks, which may portend its applications in medicine and healthcare, especially in medical imaging, in the foreseeable future. This review article offers perspectives on the history, development, and applications of deep learning technology, particularly regarding its applications in medical imaging.
            • Record: found
            • Abstract: found
            • Article: not found

            Case-control and two-gate designs in diagnostic accuracy studies.

            In some diagnostic accuracy studies, the test results of a series of patients with an established diagnosis are compared with those of a control group. Such case-control designs are intuitively appealing, but they have also been criticized for leading to inflated estimates of accuracy. We discuss similarities and differences between diagnostic and etiologic case-control studies, as well as the mechanisms that can lead to variation in estimates of diagnostic accuracy in studies with separate sampling schemes ("gates") for diseased (cases) and nondiseased individuals (controls). Diagnostic accuracy studies are cross-sectional and descriptive in nature. Etiologic case-control studies aim to quantify the effect of potential causal exposures on disease occurrence, which inherently involves a time window between exposure and disease occurrence. Researchers and readers should be aware of spectrum effects in diagnostic case-control studies as a result of the restricted sampling of cases and/or controls, which can lead to changes in estimates of diagnostic accuracy. These spectrum effects may be advantageous in the early investigation of a new diagnostic test, but for an overall evaluation of the clinical performance of a test, case-control studies should closely mimic cross-sectional diagnostic studies. As the accuracy of a test is likely to vary across subgroups of patients, researchers and clinicians might carefully consider the potential for spectrum effects in all designs and analyses, particularly in diagnostic accuracy studies with differential sampling schemes for diseased (cases) and nondiseased individuals (controls).
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study

              Background There is interest in using convolutional neural networks (CNNs) to analyze medical imaging to provide computer-aided diagnosis (CAD). Recent work has suggested that image classification CNNs may not generalize to new data as well as previously believed. We assessed how well CNNs generalized across three hospital systems for a simulated pneumonia screening task. Methods and findings A cross-sectional design with multiple model training cohorts was used to evaluate model generalizability to external sites using split-sample validation. A total of 158,323 chest radiographs were drawn from three institutions: National Institutes of Health Clinical Center (NIH; 112,120 from 30,805 patients), Mount Sinai Hospital (MSH; 42,396 from 12,904 patients), and Indiana University Network for Patient Care (IU; 3,807 from 3,683 patients). These patient populations had an age mean (SD) of 46.9 years (16.6), 63.2 years (16.5), and 49.6 years (17) with a female percentage of 43.5%, 44.8%, and 57.3%, respectively. We assessed individual models using the area under the receiver operating characteristic curve (AUC) for radiographic findings consistent with pneumonia and compared performance on different test sets with DeLong’s test. The prevalence of pneumonia was high enough at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%) that merely sorting by hospital system achieved an AUC of 0.861 (95% CI 0.855–0.866) on the joint MSH–NIH dataset. Models trained on data from either NIH or MSH had equivalent performance on IU (P values 0.580 and 0.273, respectively) and inferior performance on data from each other relative to an internal test set (i.e., new data from within the hospital system used for training data; P values both <0.001). The highest internal performance was achieved by combining training and test data from MSH and NIH (AUC 0.931, 95% CI 0.927–0.936), but this model demonstrated significantly lower external performance at IU (AUC 0.815, 95% CI 0.745–0.885, P = 0.001). To test the effect of pooling data from sites with disparate pneumonia prevalence, we used stratified subsampling to generate MSH–NIH cohorts that only differed in disease prevalence between training data sites. When both training data sites had the same pneumonia prevalence, the model performed consistently on external IU data (P = 0.88). When a 10-fold difference in pneumonia rate was introduced between sites, internal test performance improved compared to the balanced model (10× MSH risk P < 0.001; 10× NIH P = 0.002), but this outperformance failed to generalize to IU (MSH 10× P < 0.001; NIH 10× P = 0.027). CNNs were able to directly detect hospital system of a radiograph for 99.95% NIH (22,050/22,062) and 99.98% MSH (8,386/8,388) radiographs. The primary limitation of our approach and the available public data is that we cannot fully assess what other factors might be contributing to hospital system–specific biases. Conclusion Pneumonia-screening CNNs achieved better internal than external performance in 3 out of 5 natural comparisons. When models were trained on pooled data from sites with different pneumonia prevalence, they performed better on new pooled data from these sites but not on external data. CNNs robustly identified hospital system and department within a hospital, which can have large differences in disease burden and may confound predictions.

                Author and article information

                Korean J Radiol
                Korean J Radiol
                Korean Journal of Radiology
                The Korean Society of Radiology
                March 2019
                19 February 2019
                : 20
                : 3
                : 405-410
                [1 ]Department of Radiology, Taean-gun Health Center and County Hospital, Taean-gun, Korea.
                [2 ]Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Korea.
                Author notes
                Corresponding author: Seong Ho Park, MD, PhD, Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Korea. Tel: (822) 3010-5984, Fax: (822) 476-4719, seongho@ 123456amc.seoul.kr

                *These authors contributed equally to this work.

                Copyright © 2019 The Korean Society of Radiology

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

                Funded by: Korea Health Industry Development Institute, CrossRef https://doi.org/10.13039/501100003710;
                Award ID: HI18C1216
                Artificial Intelligence
                Original Article


                Comment on this article