Aim: Assessing the visual accuracy of two large language models (LLMs) in microbial classification.
Materials & methods: GPT-4o and Gemini 1.5 Pro were evaluated in distinguishing Gram-positive from Gram-negative bacteria and classifying them as cocci or bacilli using 80 Gram stain images from a labeled database.
Results: GPT-4o achieved 100% accuracy in identifying simultaneously Gram stain and shape for Clostridium perfringens, Pseudomonas aeruginosa and Staphylococcus aureus. Gemini 1.5 Pro showed more variability for similar bacteria (45, 100 and 95%, respectively). Both LLMs failed to identify both Gram stain and bacterial shape for Neisseria gonorrhoeae. Cumulative accuracy plots indicated that GPT-4o consistently performed equally or better in every identification, except for Neisseria gonorrhoeae's shape.
Conclusion: These results suggest that these LLMs in their unprimed state are not ready to be implemented in clinical practice and highlight the need for more research with larger datasets to improve LLMs' effectiveness in clinical microbiology.
This study looked at how well large language models (LLMs) could identify different types of bacteria using images, without having any specific training in this area beforehand.
We tested two LLMs with image analysis capabilities, GPT-4o and Gemini 1.5 Pro. These models were asked to determine whether bacteria were Gram-positive or Gram-negative and whether they were round (cocci) or rod-shaped (bacilli). We used 80 images of four stained bacteria from a labeled database as a reference for this test.
GPT-4o was more accurate in identifying both the Gram stain and shape of the bacteria compared with Gemini 1.5 Pro. GPT-4o had excellent accuracy in correctly classifying the Gram stain and bacterial shape of Clostridium perfringens, Pseudomonas aeruginosa and Staphylococcus aureus. Gemini 1.5 Pro had mixed results for these bacteria. However, both models struggled with Neisseria gonorrhoeae, failing to correctly identify its Gram stain and shape.
The study shows that while these LLMs have potential, they are not ready to be implemented in clinical practice. More research and larger datasets are needed to improve their accuracy in clinical microbiology.
Large language models (LLMs) are advanced artificial intelligence models, able to generate human-like text, with sophisticated natural language processing capabilities. They are trained on vast amounts of data and use deep learning techniques to understand complex inputs and produce language.
Recent studies have shown the potential of LLMs in medical image analysis across various fields like pathology and ophthalmology, demonstrating their ability to interpret complex medical visual data.
Clinical decisions on infection management such as initial antibiotic choice often rely on Gram stain results. Thus, it is crucial to ensure these tests are conducted and interpreted accurately.
Two LLMs were used in this study: Open AI's generative pretrained transformer (GPT), version 4 Omni (GPT-4o) and Google's Gemini version 1.5 Pro.
To the best of our knowledge, this study represents the first known accuracy analysis of the latest and most advanced visual LLMs, GPT-4o and Gemini 1.5 Pro, in the domain of Gram stain and bacterial shape identification.
A publicly available database of bacterial Gram stains was used. 80 bacterial samples were divided evenly among four bacteria representing all possible combinations of Gram Stain and bacterial shape.
GPT-4o correctly identified both the Gram stain and bacterial shape simultaneously with higher accuracy then Gemini 1.5 Pro (75 vs. 60%, respectively).
When examining the performance by specific bacteria, GPT-4o achieved 100% accuracy in identifying both the Gram stain and shape correctly for Clostridium perfringens, Pseudomonas aeruginosa and Staphylococcus aureus. Gemini 1.5 Pro, on the other hand, showed more variability in its performance for the same bacteria (45, 100 and 95%, respectively). However, both LLMs failed to correctly identify the Gram stain and bacterial shape in all cases (0% accuracy) with Neisseria gonorrhoeae.
Cumulative accuracy plots indicated that GPT-4o consistently performed equally or better in every identification, except for Neisseria gonorrhoeae‘s shape.
The results from this study provide valuable insights into the potential and limitations of these LLMs in microbial classification tasks, demonstrating their potential for microbial classification tasks without prior domain-specific training.
The results suggest that these LLMs in their unprimed state are not ready to be implemented in clinical practice.
The results underscore the need for further research with larger and more diverse datasets, as well as offline clinical samples, to better understand and enhance the capabilities of these LLMs in clinical microbiology.