Opportunities and risks of ChatGPT in medicine, science, and academic publishing: a modern Promethean dilemma

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The release of ChatGPT, the latest large (175-billion-parameter) language model by San Francisco-based company OpenAI, prompted many to think about the exciting (and troublesome) ways artificial intelligence (AI) might change our lives in the very near future. The OpenAI's chatbot allegedly gained more than 1 million users in the first few days after its launch and 100 million in the first 2 months, positioning itself as the fastest-growing consumer application in history (1). The hype surrounding ChatGPT is not unjustified: the model is (still) free, easy to use, and able to authentically converse on many subjects in a way that is almost indistinguishable from human communication. Furthermore, considering that ChatGPT was generated by fine-tuning the GPT-3.5 model from early 2022 with supervised and reinforcement learning (2), the quality of the chatbot-generated content can only be improved with additional training and optimization. As the inevitable implementation of this disruptive technology will have far-reaching consequences for medicine, science, and academic publishing, we need to discuss both the opportunities and risks of its use. Can ChatGPT replace physicians? AI has a tremendous potential to revolutionize health care and make it more efficient by improving diagnostics, detecting medical errors, and reducing the burden of paperwork (3,4); however, chances are it will never replace physicians. Algorithms perform relatively well on knowledge-based tests despite the lack of domain-specific training; ChatGPT achieved ~ 66% and ~ 72% on Basic Life Support and Advanced Cardiovascular Life Support tests, respectively (5), and performed at or near the passing threshold on the United States Medical Licensing Exam (6,7). However, they are notoriously bad at context and nuance (8) – two things critical for safe and effective patient care, which requires the implementation of medical knowledge, concepts, and principles in real-world settings. In their analysis of the future of employment, Frey and Osborne estimate that, while the probability of administrative health care jobs automation is relatively high (eg, 91% for health information technicians), the probability of automating the jobs of physicians and surgeons is 0.42% (9). While we might object as some evidence indicates that fully autonomous robotic systems might be “just around the corner“ (10), the job of a surgeon goes far beyond performing a surgical procedure. The complexity of the physician's job lies in the ability to administer fully integrated care by providing treatment but also compassion. As medical students we were taught to always take care of patients and not of their medical records – a clinical skill that computer algorithms are still not able to comprehend. Therefore, the tremendous potential of AI in healthcare does not lie in the possibility of replacing physicians, but rather in the capacity to increase physicians’ efficacy by redistributing workload and optimizing performance. In the words of Alvin Powell from The Harvard Gazette, „A properly developed and deployed AI, experts say, will be akin to the cavalry riding in to help beleaguered physicians struggling with unrelenting workloads, high administrative burdens, and a tsunami of new clinical data.“ (11). There are also some ethical issues to consider regarding conversational AI in medical practice. Training a model requires a tremendous amount of (high-quality) data, and current algorithms are often trained on biased data sets. In fact, the models are not only susceptible to availability, selection, and confirmation bias but are also unreluctant to amplify it (12). For example, ChatGPT can provide biased outputs and perpetuate sexist stereotypes (13) – a challenge that has to be resolved before similar AI can be successfully and safely implemented in clinical practice (14-17). Other ethical issues are related to the legal framework. For example, it remains to be determined who is to blame when an AI physician makes an inevitable mistake. A chatbot-scientist ChatGPT already wrote essays, scholarly manuscripts, and computer code, summarized scientific literature, and performed statistical analyses (18,19). Furthermore, AI might soon be able to successfully perform more complex assignments such as designing experiments (20) or conducting a peer-review (18). In some of the mentioned tasks, ChatGPT performed alarmingly well. In a recent experiment, researchers used existing publications to generate 50 research abstracts that were able to pass the plagiarism check performed by a plagiarism checker, an AI-output detector, and human reviewers (21). On the one hand, the astounding ability of ChatGPT to write specialized texts suggests that similar tools might soon be able to write complete research manuscripts, which would enable scientists to focus on designing and performing the experiments rather than on writing manuscripts (18). The latter might promote quality and equity in research by shifting the focus from the presentation to the content and experimental results. On the other hand, conversational AIs are just language models trained to sound convincing, but without the ability to interpret and understand the content. Consequently, ChatGPT-generated manuscripts might be misleading, based on non-credible or completely made-up sources (18). The worst part is, the ability of ChatGPT to write a text of surprising quality might deceive reviewers and readers, with the final result being an accumulation of dangerous misinformation. StackOverflow, a popular forum for computer programming-related discussions, banned the use of ChatGPT-generated text “because the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking and looking for correct answers“ (22). ChatGPT seems to be equally unreliable when it comes to writing research articles. For example, Blanco-Gonzalez et al assessed the ability of ChatGPT to assist human authors in writing review articles and concluded that “…ChatGPT is not a useful tool for writing reliable scientific texts without strong human intervention. It lacks the knowledge and expertise necessary to accurately and adequately convey complex scientific concepts and information.” (23). On top of that, the chatbot seems to have an alarming tendency to make up references with the goal of sounding convincing (18,24,25). In fact, the creators of ChatGPT openly disclosed that the fact that “ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers” a “challenging issue to fix“ (2). A failure to acknowledge the limitations of conversational AI might pose an additional strain on the publishing system already flooded with meaningless data and low-quality manuscripts. Apart from the problem of unreliability, there are several additional ethical challenges (18,19,26). A chatbot cannot be held accountable for its work, and there is no legal framework to determine who owns the rights to the AI-generated work – the author of the manuscript, the author of the AI, or the (unknown) authors who contributed training data? Furthermore, since ChatGPT often fails to disclose the source of information, who is to blame for plagiarism if the chatbot decides to plagiarize? Until the ethical dilemmas are resolved, most publishers agree that the use of any kind of AI should be clearly acknowledged and that chatbots should not be listed as authors. Where do we go from here? The powerful disruptive technology of conversational AIs is here to stay, and we can only expect them to improve with additional training and optimization. Banning or actively ignoring their use makes no sense – they can dramatically improve many aspects of our lives by alleviating the burden of daunting and repetitive tasks. In medicine, AI might dramatically improve efficacy just by alleviating a fragment of the suffocating paperwork (27), and optimized chatbots (eg, Stanford's BioMedLM) (28) might speed up and improve literature search. Nevertheless, we should not be allured by the overwhelming potential of AI. For AI to realize its full potential in medicine and science, we should not implement it hastily but advocate its mindful introduction and an open debate about the risks and benefits.

Related collections

Most cited references 20

Record: found
Abstract: not found
Article: not found

The future of employment: How susceptible are jobs to computerisation?

Carl Frey, Michael A. Osborne (2017)

0 comments Cited 617 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla … (2023)

We evaluated the performance of a large language model called ChatGPT on the United States Medical Licensing Exam (USMLE), which consists of three exams: Step 1, Step 2CK, and Step 3. ChatGPT performed at or near the passing threshold for all three exams without any specialized training or reinforcement. Additionally, ChatGPT demonstrated a high level of concordance and insight in its explanations. These results suggest that large language models may have the potential to assist with medical education, and potentially, clinical decision-making.

0 comments Cited 537 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Aidan Gilson, Conrad W Safranek, Thomas Huang … (2023)

Background Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results Of the 4 data sets, AMBOSS-Step1 , AMBOSS-Step2 , NBME-Free-Step1 , and NBME-Free-Step2 , ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased ( P =.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 ( P <.001) and NBME-Free-Step2 ( P =.001) data sets, respectively. Conclusions ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.

0 comments Cited 295 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Croat Med J

Journal ID (iso-abbrev): Croat Med J

Journal ID (publisher-id): CMJ

Title: Croatian Medical Journal

Publisher: Croatian Medical Schools

ISSN (Print): 0353-9504

ISSN (Electronic): 1332-8166

Publication date (Print): February 2023

Volume: 64

Issue: 1

Pages: 1-3

Affiliations

[1 ]Department of Pharmacology, University of Zagreb School of Medicine, Zagreb, Croatia

[2 ]Croatian Institute for Brain Research, University of Zagreb School of Medicine, Zagreb, Croatia

Author notes

Jan.homolak@ 123456mef.hr

Article

Publisher ID: CroatMedJ_64_0001

DOI: 10.3325/cmj.2023.64.1

PMC ID: 10028563

PubMed ID: 36864812

SO-VID: bdc0a4ca-7d74-42d0-9794-621798c3c627

License:

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Opportunities and risks of ChatGPT in medicine, science, and academic publishing: a modern Promethean dilemma

Read this article at

Abstract

Related collections

25th European Students' Conference

Most cited references 20

The future of employment: How susceptible are jobs to computerisation?

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 1,520

Cited by 47

Most referenced authors 172