ChatGPT™ is a chatbot defined Artificial Intelligence program launched by San Francisco–based
OpenAI on November 30th, 2022, with the ability to hold human-like conversations.
1
Although literature commenting on ChatGPT ™’s abilities has grown over the past months,
individual studies assessing its utility in clinical care, research and teaching in
the field of Gastroenterology (GI) has been scarce with only 2 reported studies.
2,3
Our study assesses ChatGPT™’s ability to answer queries regarding appropriate colonoscopy
intervals for colon cancer screening compared to currently applicable guidelines.
Utilizing the American Gastroenterological Association (AGA) ’s recommendations for
follow-up after colonoscopy and polypectomy,
4,5
12 questions were developed to query ChatGPT ™ (Table). The queries were entered into
ChatGPT ™ by the author (SM) with the responses being separately documented (Appendix
1). Each of the 12 query-response pairs underwent adjudication by 4 senior GI fellows
(CD, AP, NF, IU) who graded the responses on a semi-qualitative scale over a set of
5 options ranging from “addresses the query and is factually entirely correct” to
“does not address the query and is factually incorrect”. A field to comment on the
potential usefulness to patients was provided. Adjudicators were provided a copy of
the AGA guideline as base truth to aid assessment of responses. All 4 adjudicators
were blinded regarding the source of the responses to reduce potential bias. All adjudicators
were informed that the responses were generated by ChatGPT ™ after conclusion of the
study. The study did not meet criteria for institutional review board submission given
the absence of human subjects.
Three of 4 (75%) adjudicators felt that ChatGPT™’s response to Q1 (What is the risk
developing a colon cancer leading to death after a clear colonoscopy?) addressed the
query and was factually correct. One of 4 stated it was inaccurate in its reporting
of colon cancer incidence as a percentage (as opposed to a hazard ratio). Three of
4 felt the answers would be usable by patients.
Only 50% (2/4) of the adjudicators felt that ChatGPT™’s response to Q2 (When should
colon screening be repeated in a patient with a quality colonoscopy?) addressed the
query and was factually correct. 100% agreed that the answer would be usable by patients.
ChatGPT™ had suggested starting colon cancer screening at 50, with repeat colonoscopies
every 10 years. While it was accurate regarding the time interval for repeat colonoscopy,
it was inaccurate regarding the age to initiate screening (45 for average risk).
Similarly, when assessing ChatGPT™ ’s response to Q3 (Repeat colonoscopy for patients
who had 1–2 small tubular adenomas <10 mm in size that have been completely resected
at a high-quality examination?), 75% (3/4) felt that the queries would be usable by
patients and 75% (3/4) agreed that while it did address the query, it contained both
correct and incorrect responses. ChatGPT™’s response was that the interval was to
be “5–10 years” (instead of 7–10 years).
Kappa for interrater reliability was 0.189 for all 12 questions, 0.248 for the first
3 questions and 0.704 when assessing patient usability. Analysis was performed using
RStudio.
6
A summary of all queries is presented in Table. Critical observations of all responses
are presented in Appendix 1. None of the responses completely inaccurate as none were
found by all adjudicators to be completely wrong. ChatGPT™ was also able to identify
rare genetic syndromes in Q11–12.
ChatGPT™’s introduction has generated widespread interest the academic community.
Its ability to draft entire essays and even pass the United States Medical Licensing
Exam has led to debates about the ethics of its use.
1,7
One area which continues to generate discussion is the question of its authorship
on publications.
8,9
Scholarly societies, such as World Association of Medical Editors state that chatbots
cannot be authors as they do not create new knowledge.
10
Its capabilities in GI education and research remains relatively unexplored, with
only 2 studies describing early experience.
2,3
Lahat et al,
2
assessed its ability to identify questions related to GI research and concluded that
while it was able to frame questions, they were not considered novel. Yeo et al
3
assessed its ability to answer questions on the management of liver cirrhosis and
hepatocellular carcinoma where it performed favorably.
The purpose of our study was 2-fold: First, can ChatGPT™ accurately answer queries
regarding colonoscopy intervals as held to the standard of currently active guidelines?
Second, could it be a tool in patient selfeducation? Regarding the former, its ability
to respond to simple and direct questions (Questions 1–3) was greater in straightforward
queries when compared to the more nuanced questions. Regarding the latter, while there
was no patient data used in this project, adjudicator assessments suggest it may be
a useful patient tool for background information to inform discussion with treating
physicians. It is not felt to be useful for self-directed care due to potential imprecision.
The study has several strengths: we assessed the accuracy of ChatGPT™’s responses
against a standard of care guideline and found that ChatGPT™’s ability to provide
accurate responses diminishes with more complex medical queries. Additionally, our
findings highlight a potential role for ChatGPT™ as an adjunct tool for patient education
on the utility and timing of follow-up colonoscopy but should not replace information
received from a licensed medical provider.
Regarding its limitations: First, human adjudication is prone to error and the small
number of adjudicators and verbosity of ChatGPT™ responses have resulted in variability
in adjudication, as reflected in the weak kappa statistic. Second, suitability of
ChatGPT™’s responses for patient education was determined by the adjudicators as opposed
to patients. Third, ChatGPT™’s training data is current through September 2021 which
may have contributed to ChatGPT™’s inaccuracy.
In conclusion, we assessed ChatGPT™’s ability to answer queries regarding appropriate
colonoscopy intervals for colon cancer screening and surveillance. Although in its
current iteration it under-delivers, it does appear to be a potential source of background
information for patient selfeducation. As global interest in ChatGPT™ continues to
increase and the technology iterates, we expect that future renditions will be able
address nuanced queries with increased precision, serving as a readily available resource
for GI education.
Supplementary Material
1