Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions

Yung Lee, M.D., M.P.H., Léa Tessier, M.D., Karanbir Brar, M.D., Sarah Malone, B.H.Sc., David Jin, B.H.Sc., Tyler McKechnie, M.D., M.Sc., James J. Jung, M.D., Ph.D., Matthew Kroh, M.D., Jerry T. Dang, M.D., Ph.D., ASMBS Artificial Intelligence and Digital Surgery Taskforce

Abstract

Background

The American Society for Metabolic and Bariatric Surgery (ASMBS) textbook serves as a comprehensive resource for bariatric surgery, covering recent advancements and clinical questions. Testing artificial intelligence (AI) engines using this authoritative source ensures accurate and up-to-date information and provides insight in its potential implications for surgical education and training.

Objectives

To determine the quality and to compare different large language models’ (LLMs) ability to respond to textbook questions relating to bariatric surgery.

Setting

Remote.

Methods

Prompts to be entered into the LLMs were multiple-choice questions found in “The ASMBS Textbook of Bariatric Surgery, second Edition. The prompts were queried into 3 LLMs: OpenAI’s ChatGPT-4, Microsoft’s Bing, and Google’s Bard. The generated responses were assessed based on overall accuracy, the number of correct answers according to subject matter, and the number of correct answers based on question type. Statistical analysis was performed to determine the number of responses per LLMs per category that were correct.

Results

Two hundred questions were used to query the AI models. There was an overall significant difference in the accuracy of answers, with an accuracy of 83.0% for ChatGPT-4, followed by Bard (76.0%) and Bing (65.0%). Subgroup analysis revealed a significant difference between the models’ performance in question categories, with ChatGPT-4’s demonstrating the highest proportion of correct answers in questions related to treatment and surgical procedures (83.1%) and complications (91.7%). There was also a significant difference between the performance in different question types, with ChatGPT-4 showing superior performance in inclusionary questions. Bard and Bing were unable to answer certain questions whereas ChatGPT-4 left no questions unanswered.

Conclusions

LLMs, particularly ChatGPT-4, demonstrated promising accuracy when answering clinical questions related to bariatric surgery. Continued AI advancements and research is required to elucidate the potential applications of LLMs in training and education.