2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Evaluation of large language models in generating pulmonary nodule follow-up recommendations

      research-article

      Read this article at

      ScienceOpenPublisherPMC
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Rationale and objectives

          To evaluate the performance of large language models (LLMs) in generating clinically follow-up recommendations for pulmonary nodules by leveraging radiological report findings and management guidelines.

          Materials and methods

          This retrospective study included CT follow-up reports of pulmonary nodules documented by senior radiologists from September 1st, 2023, to April 30th, 2024. Sixty reports were collected for prompting engineering additionally, based on few-shot learning and the Chain of Thought methodology. Radiological findings of pulmonary nodules, along with finally prompt, were input into GPT-4o-mini or ERNIE-4.0-Turbo-8K to generate follow-up recommendations. The AI-generated recommendations were evaluated against radiologist-defined guideline-based standards through binary classification, assessing nodule risk classifications, follow-up intervals, and harmfulness. Performance metrics included sensitivity, specificity, positive/negative predictive values, and F1 score.

          Results

          On 1009 reports from 996 patients (median age, 50.0 years, IQR, 39.0–60.0 years; 511 male patients), ERNIE-4.0-Turbo-8K and GPT-4o-mini demonstrated comparable performance in both accuracy of follow-up recommendations (94.6 % vs 92.8 %, P = 0.07) and harmfulness rates (2.9 % vs 3.5 %, P = 0.48). In nodules classification, ERNIE-4.0-Turbo-8K and GPT-4o-mini performed similarly with accuracy rates of 99.8 % vs 99.9 % sensitivity of 96.9 % vs 100.0 %, specificity of 99.9 % vs 99.9 %, positive predictive value of 96.9 % vs 96.9 %, negative predictive value of 100.0 % vs 99.9 %, f1-score of 96.9 % vs 98.4 %, respectively.

          Conclusion

          LLMs show promise in providing guideline-based follow-up recommendations for pulmonary nodules, but require rigorous validation and supervision to mitigate potential clinical risks. This study offers insights into their potential role in automated radiological decision support.

          Highlights

          • LLMs showed high accuracy in pulmonary nodule follow-up recommendations.

          • High accuracy achieved in nodule classification by both models.

          • Study emphasizes need for human oversight in LLMs clinical settings.

          Related collections

          Most cited references21

          • Record: found
          • Abstract: not found
          • Article: not found

          Evaluating the Patient With a Pulmonary Nodule : A Review

            • Record: found
            • Abstract: not found
            • Article: not found

            Leveraging GPT-4 for Post Hoc Transformation of Free-Text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study

              • Record: found
              • Abstract: found
              • Article: not found

              Potential of ChatGPT and GPT-4 for Data Mining of Free-Text CT Reports on Lung Cancer.

              Background The latest large language models (LLMs) solve unseen problems via user-defined text prompts without the need for retraining, offering potentially more efficient information extraction from free-text medical records than manual annotation. Purpose To compare the performance of the LLMs ChatGPT and GPT-4 in data mining and labeling oncologic phenotypes from free-text CT reports on lung cancer by using user-defined prompts. Materials and Methods This retrospective study included patients who underwent lung cancer follow-up CT between September 2021 and March 2023. A subset of 25 reports was reserved for prompt engineering to instruct the LLMs in extracting lesion diameters, labeling metastatic disease, and assessing oncologic progression. This output was fed into a rule-based natural language processing pipeline to match ground truth annotations from four radiologists and derive performance metrics. The oncologic reasoning of LLMs was rated on a five-point Likert scale for factual correctness and accuracy. The occurrence of confabulations was recorded. Statistical analyses included Wilcoxon signed rank and McNemar tests. Results On 424 CT reports from 424 patients (mean age, 65 years ± 11 [SD]; 265 male), GPT-4 outperformed ChatGPT in extracting lesion parameters (98.6% vs 84.0%, P < .001), resulting in 96% correctly mined reports (vs 67% for ChatGPT, P < .001). GPT-4 achieved higher accuracy in identification of metastatic disease (98.1% [95% CI: 97.7, 98.5] vs 90.3% [95% CI: 89.4, 91.0]) and higher performance in generating correct labels for oncologic progression (F1 score, 0.96 [95% CI: 0.94, 0.98] vs 0.91 [95% CI: 0.89, 0.94]) (both P < .001). In oncologic reasoning, GPT-4 had higher Likert scale scores for factual correctness (4.3 vs 3.9) and accuracy (4.4 vs 3.3), with a lower rate of confabulation (1.7% vs 13.7%) than ChatGPT (all P < .001). Conclusion When using user-defined prompts, GPT-4 outperformed ChatGPT in extracting oncologic phenotypes from free-text CT reports on lung cancer and demonstrated better oncologic reasoning with fewer confabulations. © RSNA, 2023 Supplemental material is available for this article. See also the editorial by Hafezi-Nejad and Trivedi in this issue.

                Author and article information

                Contributors
                Journal
                Eur J Radiol Open
                Eur J Radiol Open
                European Journal of Radiology Open
                Elsevier
                2352-0477
                30 April 2025
                June 2025
                30 April 2025
                : 14
                : 100655
                Affiliations
                [a ]Department of Radiology, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
                [b ]Department of Interventional Radiology, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China
                [c ]Big Data and Artificial Intelligence Center, The Third Affiliated Hospital of Sun Yat-Sen University, Guangzhou, China
                Author notes
                [* ]Correspondence to: Department of Pathology, The Third Affiliated Hospital, Sun Yat-Sen University, No. 600 Tianhe Rd, Guangzhou, Guangdong 510630, PR China. lich356@ 123456mail.sysu.edu.cn
                [** ]Correspondence to: Departments of Radiology, The Third Affiliated Hospital, Sun Yat-Sen University, No. 600 Tianhe Rd, Guangzhou, Guangdong 510630, PR China. qinjie@ 123456mail.sysu.edu.cn
                [1]

                Co-first authors:

                Article
                S2352-0477(25)00022-X 100655
                10.1016/j.ejro.2025.100655
                12088779
                40391069
                932fa106-444e-4ef6-addc-ebf3b6756b7c
                © 2025 The Authors

                This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

                History
                : 22 January 2025
                : 14 April 2025
                : 25 April 2025
                Categories
                Article

                generative pre-trained transformer,large language models,radiology report,pulmonary nodule,computed tomography

                Comments

                Comment on this article

                Related Documents Log