Evaluating ChatGPT in Medical Contexts: The Imperative to Guard Against Hallucinations and Partial Accuracies

Giuffrè, Mauro; You, Kisung; Shung, Dennis L.

doi:10.1016/j.cgh.2023.09.035

Dear Editor: We have read with great interest the recent article titled “Accuracy, Reliability, and Comprehensiveness of ChatGPT-Generated Medical Responses for Patients With Nonalcoholic Fatty Liver Disease,”1 evaluating ChatGPT’s performance in addressing questions related to nonalcoholic fatty liver disease or, as more recently coined, metabolic dysfunction–associated steatotic liver disease. Although the study is timely, given the increased capabilities of large language models (LLMs) for health-related clinical decision support,2 there are substantial concerns regarding the measurements used to define the accuracy and the absence of variability quantification for ChatGPT responses. When considering the capabilities of LLMs such as ChatGPT, the issue of hallucinations poses an unknown that needs advanced methodologies beyond expert-driven Likert scales3 to help quantify the degree to which they affect patient safety. Hallucinations in this context refer to the production of plausible-sounding but potentially unverified or incorrect information.4 In a clinical context, in which decisions can have profound implications for patients’ well-being, the degree of hallucination could range from slightly misleading to outright disastrous. A patient receiving erroneous advice or information can lead to adverse health consequences, especially if they act on that information without consulting a health care professional. The current article relies on clinical expert review, but the review uses a Likert scale that permits the categorization of answers as “nearly all correct.”1 In a medical context, an answer that's almost correct still can be very wrong. A rating of “nearly all correct” might look impressive statistically, but the real-world implications of that “nearly” can be dire. For example, the case of herbal remedies in the management of metabolic dysfunction–associated steatotic liver disease poses a risk to patients if the ChatGPT response is mostly accurate, but could include potentially harmful elements. Because many herbal remedies interact with conventional medications,5 they can exacerbate existing conditions or might be toxic in certain doses, so even a partially incorrect or incomplete response may become detrimental if the patient only takes the erroneous advice and discards the rest of the recommendation. The granularity of a Likert scale might not be well suited to capture the nuances required in a medical evaluation, in which binary outcomes (correct/wrong) are more apt. By allowing a middle ground of “partially correct,” we might be inadvertently accepting answers that carry significant risks. New methodologies that quantify the degree of hallucination and estimate the effect on patient safety may mitigate the potential harm without the requirement of laborious and expensive manual expert review. The authors recognize the variability of responses from ChatGPT, but do not present significant evidence to quantify the degree of variation in the responses. We agree that responses generated by ChatGPT and other LLMs may vary as a result of training data, context, linguistic nuance, and even over time as new data are used to fine-tune LLMs; however, in the high-stakes and time-constrained medical context, we should expect there to be a defined limit in the variation allowed by these systems. Although the fluctuating nature of responses may be tolerable for other tasks, in the medical context the probability of providing accurate vs inaccurate information must be quantified and constrained. This is particularly important as patients or even health care professionals increase their use of LLMs and may rely on them as a primary information source without cross-referencing with trusted medical sources. When dealing with a subject as important as personal health, consistent information is crucial. Part of safety testing these responses by limiting variability also should include stress testing with subtle changes in query phrasing in the prompt architecture or in-context learning to evaluate which changes may mislead users inadvertently. In conclusion, we acknowledge that the study provides valuable insights into the potential utility of LLMs such as ChatGPT in medical contexts, yet it also underscores the importance of understanding the risks of these systems and the role of rigorous interdisciplinary research to appropriately define the boundaries of accuracy and tolerable variability. The potential risks of hallucinations, the variability of responses, and the limitations of evaluation metrics can deeply impact the reliability of such models. Although we may not be able to provide unconditional guarantees regarding the veracity of the LLM responses, new methodologies must be developed to evaluate the degree of inaccuracy and we should define the limit of tolerable variability. As the medical community continues to engage with the AI community, we have the fiduciary duty to protect patients from misinformation and ensure that we abide by the ethical principle of our profession: primum non nocere.

ArTS Archivio della ricerca di Trieste