Pneumothorax can precipitate a life-threatening emergency due to lung collapse and respiratory or circulatory distress. Pneumothorax is typically detected on chest X-ray; however, treatment is reliant on timely review of radiographs. Since current imaging volumes may result in long worklists of radiographs awaiting review, an automated method of prioritizing X-rays with pneumothorax may reduce time to treatment. Our objective was to create a large human-annotated dataset of chest X-rays containing pneumothorax and to train deep convolutional networks to screen for potentially emergent moderate or large pneumothorax at the time of image acquisition.
In all, 13,292 frontal chest X-rays (3,107 with pneumothorax) were visually annotated by radiologists. This dataset was used to train and evaluate multiple network architectures. Images showing large- or moderate-sized pneumothorax were considered positive, and those with trace or no pneumothorax were considered negative. Images showing small pneumothorax were excluded from training. Using an internal validation set ( n = 1,993), we selected the 2 top-performing models; these models were then evaluated on a held-out internal test set based on area under the receiver operating characteristic curve (AUC), sensitivity, specificity, and positive predictive value (PPV). The final internal test was performed initially on a subset with small pneumothorax excluded (as in training; n = 1,701), then on the full test set ( n = 1,990), with small pneumothorax included as positive. External evaluation was performed using the National Institutes of Health (NIH) ChestX-ray14 set, a public dataset labeled for chest pathology based on text reports. All images labeled with pneumothorax were considered positive, because the NIH set does not classify pneumothorax by size. In internal testing, our “high sensitivity model” produced a sensitivity of 0.84 (95% CI 0.78–0.90), specificity of 0.90 (95% CI 0.89–0.92), and AUC of 0.94 for the test subset with small pneumothorax excluded. Our “high specificity model” showed sensitivity of 0.80 (95% CI 0.72–0.86), specificity of 0.97 (95% CI 0.96–0.98), and AUC of 0.96 for this set. PPVs were 0.45 (95% CI 0.39–0.51) and 0.71 (95% CI 0.63–0.77), respectively. Internal testing on the full set showed expected decreased performance (sensitivity 0.55, specificity 0.90, and AUC 0.82 for high sensitivity model and sensitivity 0.45, specificity 0.97, and AUC 0.86 for high specificity model). External testing using the NIH dataset showed some further performance decline (sensitivity 0.28–0.49, specificity 0.85–0.97, and AUC 0.75 for both). Due to labeling differences between internal and external datasets, these findings represent a preliminary step towards external validation.
We trained automated classifiers to detect moderate and large pneumothorax in frontal chest X-rays at high levels of performance on held-out test data. These models may provide a high specificity screening solution to detect moderate or large pneumothorax on images collected when human review might be delayed, such as overnight. They are not intended for unsupervised diagnosis of all pneumothoraces, as many small pneumothoraces (and some larger ones) are not detected by the algorithm. Implementation studies are warranted to develop appropriate, effective clinician alerts for the potentially critical finding of pneumothorax, and to assess their impact on reducing time to treatment.
Pneumothorax (collapse of the lung due to air in the chest) can be a life-threatening emergency.
Delays in identifying and treating serious pneumothorax can result in severe harm to patients, including death.
Pneumothorax is often detected by chest X-ray, but delays in review of these images (particularly at hours of lower staffing, such as overnight) can lead to delay in diagnosis and treatment.
Prioritization of images that are suspected to show a pneumothorax for rapid review may result in earlier treatment of pneumothorax.
We developed computer algorithms that scan chest X-rays and flag images that are suspicious for containing a moderate or large pneumothorax.
These algorithms “learned” to identify moderate- and large-sized pneumothorax by training on a large set of both positive and negative chest X-rays.
We created the training set of images by asking board-certified radiologists to label each image for the presence or absence of pneumothorax, as well as their estimate of pneumothorax size.
After training, we tested the performance of the algorithms on a similar collection of labeled X-rays that had never been seen by the algorithms and analyzed their success at detecting images showing pneumothorax, without any human guidance.
We found that our algorithms were able to detect the majority (80%–84%) of images showing a moderate or large pneumothorax, while correctly categorizing 90% or more of images without pneumothorax as “negative.” When we included small pneumothoraces in our test set, performance declined, as expected because the algorithms had not been trained on images with small pneumothoraces.
When testing our algorithms using images acquired outside our hospital, performance declined compared with our internal testing. However, the tests of the external dataset were not exactly comparable to our internal tests: small pneumothoraces could not be excluded from the evaluation because labels in the external dataset did not include size, and labels were assigned by computer interpretation of clinical reports rather than radiologists reevaluating the images, limiting the accuracy of the labels.
Computer algorithms, given enough high-quality training data, are capable of detecting pneumothorax on a chest X-ray with sufficient accuracy to help prioritize images for rapid review by physicians.
Algorithms like these could potentially be used by radiologists as a tool to increase the speed with which a serious pneumothorax is detected, even at times of lower staffing, when turnaround times are typically longer.
Rapid detection and communication with treating physicians may result in faster treatment of pneumothorax, potentially reducing the harm of a serious medical problem.
The transferability of our models to clinical settings outside the institution where the training images were acquired needs further validation. Although we evaluated the models against an external dataset, differences in the composition, curation, and labeling between the external data and our own make it difficult to interpret these external dataset results.