Assessing ChatGPT’s capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Background:

The application of large language models in clinical decision support (CDS) is an area that warrants further investigation. ChatGPT, a prominent large language models developed by OpenAI, has shown promising performance across various domains. However, there is limited research evaluating its use specifically in pediatric clinical decision-making. This study aimed to assess ChatGPT’s potential as a CDS tool in pediatrics by evCDSaluating its performance on 8 common clinical symptom prompts. Study objectives were to answer the 2 research questions: the ChatGPT’s overall grade in a range from A (high) to E (low) compared to a normal sample and the difference in assessment of ChatGPT between 2 pediatricians.

Methods:

We compared ChatGPT’s responses to 8 items related to clinical symptoms commonly encountered by pediatricians. Two pediatricians independently assessed the answers provided by ChatGPT in an open-ended format. The scoring system ranged from 0 to 100, which was then transformed into 5 ordinal categories. We simulated 300 virtual students with a normal distribution to provide scores on items based on Rasch rating scale model and their difficulties in a range between −2 to 2.5 logits. Two visual presentations (Wright map and KIDMAP) were generated to answer the 2 research questions outlined in the objectives of the study.

Results:

The 2 pediatricians’ assessments indicated that ChatGPT’s overall performance corresponded to a grade of C in a range from A to E, with average scores of −0.89 logits and 0.90 logits (=log odds), respectively. The assessments revealed a significant difference in performance between the 2 pediatricians ( P < .05), with scores of −0.89 (SE = 0.37) and 0.90 (SE = 0.41) in log odds units (logits in Rasch analysis).

Conclusion:

This study demonstrates the feasibility of utilizing ChatGPT as a CDS tool for patients presenting with common pediatric symptoms. The findings suggest that ChatGPT has the potential to enhance clinical workflow and aid in responsible clinical decision-making. Further exploration and refinement of ChatGPT’s capabilities in pediatric care can potentially contribute to improved healthcare outcomes and patient management.

Related collections

Most cited references 40

Record: found
Abstract: found
Article: not found

A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research.

Terry Koo, Mae Li (2016)

Intraclass correlation coefficient (ICC) is a widely used reliability index in test-retest, intrarater, and interrater reliability analyses. This article introduces the basic concept of ICC in the content of reliability analysis.

0 comments Cited 5073 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla … (2023)

We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.

0 comments Cited 539 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Aidan Gilson, Conrad W Safranek, Thomas Huang … (2023)

Background Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results Of the 4 data sets, AMBOSS-Step1 , AMBOSS-Step2 , NBME-Free-Step1 , and NBME-Free-Step2 , ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased ( P =.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 ( P <.001) and NBME-Free-Step2 ( P =.001) data sets, respectively. Conclusions ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.

0 comments Cited 298 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Hsu-Ju Kao

Tsair-Wei Chien

Wen-Chung Wang

Willy Chou:

ORCID: https://orcid.org/0000-0002-1132-9341

Journal

Journal ID (nlm-ta): Medicine (Baltimore)

Journal ID (publisher-id): MD

Title: Medicine

Publisher: Lippincott Williams & Wilkins (Hagerstown, MD )

ISSN (Print): 0025-7974

ISSN (Electronic): 1536-5964

Publication date Collection: 23 June 2023

Publication date (Electronic): 23 June 2023

Volume: 102

Issue: 25

Electronic Location Identifier: e34068

Affiliations

[a ] Department of Internal Medicine, Chi Mei Medical Center, Chiali, Taiwan

[b ] Department of Medical Research, Chi-Mei Medical Center, Tainan, Taiwan

[c ] The Education University of Hong Kong, Hong Kong, China

[d ] Department of Physical Medicine and Rehabilitation, Chi Mei Medical Center, Tainan, Taiwan

[e ] Department of Physical Medicine and Rehabilitation, Chung San Medical University Hospital, Taichung, Taiwan

[f ] Department of Pediatrics, Chi Mei Medical Center, Tainan, Taiwan

[g ] Department of Pediatrics, School of Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan.

Author notes

* Correspondence: Julie Chi Chow, Chi-Mei Medical Center, 901 Chung Hwa Road, Yung Kung Dist., Tainan 710, Taiwan (e-mail: jcchow2@ 123456yahoo.com.tw ).

Author information

Willy Chou https://orcid.org/0000-0002-1132-9341

Julie Chi Chow https://orcid.org/0000-0003-3150-4917

Article

Accession ID: 00035

DOI: 10.1097/MD.0000000000034068

PMC ID: 10289633

PubMed ID: 37352054

SO-VID: f27067b5-165c-4ef7-a322-603a867bb38f

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial License 4.0 (CCBY-NC), where it is permissible to download, share, remix, transform, and buildup the work provided it is properly cited. The work cannot be used commercially without permission from the journal.

History

Date received : 23 February 2023

Date revision received : 18 May 2023

Date accepted : 1 June 2023

Custom metadata

OPEN-ACCESS TRUE

SDC T

Keywords: artificial intelligence,chatgpt,kidmap,logit,pediatrics,rasch analysis,wright map

Data availability:

Keywords: artificial intelligence, chatgpt, kidmap, logit, pediatrics, rasch analysis, wright map

Assessing ChatGPT’s capacity for clinical decision support in pediatrics: A comparative study with pediatricians using KIDMAP of Rasch analysis

Read this article at

Background:

Methods:

Results:

Conclusion:

Related collections

Artificial Intelligence in Medicine

Most cited references 40

A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 437

Cited by 7

Most referenced authors 387