In Reference to The Comparative Diagnostic Capability of Large Language Models in Otolaryngology

Maniaci, Antonino; Lentini, Mario; Boscolo-Rizzo, Paolo; Lechien, Jerome R

doi:10.1002/lary.31956

The study by Warrier et al., "The Comparative Diagnostic Capability of Large Language Models in Otolaryngology," addresses the growing integration of artificial intelligence (AI) in clinical practice. Using 100 clinical vignettes, the authors evaluated ChatGPT-3.5, Google Bard, and Bing-GPT4, demonstrating that ChatGPT-3.5 achieved a 95.7% accuracy rate, outperforming its counterparts. This underscores the diagnostic potential of large language models (LLMs) in otolaryngology, complementing recent studies highlighting ChatGPT-4's reliability in analyzing laryngeal images. However, performance variability among LLMs and the evolving nature of AI necessitate careful implementation and oversight. The study primarily focuses on diagnostic accuracy, omitting considerations of clinical reasoning and the potential for AI to augment rather than replace human expertise. Future research should incorporate measures evaluating the relevance and quality of AI-generated explanations, as explored by Zalzal et al., and adopt standardized tools such as the Artificial Intelligence Performance Instrument (AIPI) to enhance comparability. While the findings are promising, studies like this are critical for guiding responsible AI integration and identifying areas for improvement in medical applications. Warrier et al. provide valuable insights into the capabilities and limitations of LLMs in otolaryngology, contributing to the ongoing discourse on AI's role in clinical decision-making.