Dear Editor: We have read with great interest the recent article titled “Accuracy of ChatGPT in Common Gastrointestinal Diseases: Impact for Patients and Providers”1 evaluating ChatGPT’s performance in addressing questions related to 3 main domains: irritable bowel syndrome, inflammatory bowel disease, and colonoscopy and colorectal cancer from patient and physician perspectives. Although the study provides timely insights into how large language models (LLMs) might support clinical settings, we would like to evaluate the methodology used to measure accuracy critically. This study uses 2 classification methods to assess the accuracy of ChatGPT's responses: a detailed, granular approach that labels responses as completely accurate, completely inaccurate, partially inaccurate, or accurate with missing information; and a simpler method that categorizes responses as either accurate or inaccurate. However, answers categorized as “accurate” may contain partially inaccurate information, which is reflected by the reported overall accuracy of 75%–80% of answers that drop to 50%–55% when accuracy is evaluated using the granular approach.1 We believe that a definition of completely accurate answers best reflects the needs of clinical practice, where even minor errors can endanger patients' well-being.2 But what really defines accuracy? Information can be categorized into 2 macrocategories: world-knowledge and domain-knowledge. The data that LLMs are trained on can often be unstructured and contain conflicting information, which predominantly shapes world-knowledge. This contrasts with domain-specific knowledge, such as medical guidelines, which is typically systematic, well-organized, and curated by experts. Training on this inconsistent and sometimes inaccurate world-knowledge data can lead LLMs to generate responses that contain fact-conflicting hallucinations, where the model asserts information that is factually incorrect, reflecting the inaccuracies present within its training dataset.3 Therefore, binding accuracy to domain-knowledge is crucial when defining accuracy. If the information provided in medical guidelines can answer a given question, that corpora of text should be used to evaluate the accuracy. Otherwise, if the questions cannot be answered by the information contained in the guidelines, the same questions should be answered by world clinical experts in the specified domain, and their answers would represent the gold standard for accuracy. These criteria would reflect the importance of maintaining trust, which is fundamental in clinical practice, as we engage with these emerging technologies. Another concern is related to maintaining patient privacy by disclosing the generation process for clinical-scenario questions (eg, derived from generalized sources, such as Google Trends, or based on actual clinical cases).1 Real patient data provided to LLMs may raise concerns about data privacy and the ethical use of sensitive health information, which are crucial given the potential for data misuse on platforms using LLMs.4 These concerns underscore the importance of a secure, local hospital-based computational environment to protect patient data. In response to these real and pressing risks, partnerships between LLM and electronic health record companies (Epic and Microsoft, Meditech and Google Health, Oracle-Cerner merger) aim to safeguard any patient data processed by LLM products. Because of the current uncertainty regarding the treatment and retention of information on these platforms, it is advisable to exercise caution and refrain from using them with sensitive data that could lead to patient reidentification. We believe that using LLMs, such as GPT-4, will become a feature of patient care and provider practice in the 21st century. The role of LLMs as either a helpful advisor or an unwelcome interlocutor depends on whether they enhance patient-provider trust in health care delivery. To maintain trustworthiness, narrowly defining accuracy for medical questions is the first step to developing and deploying strategies (eg, prompt engineering, fine-tuning through reinforcement learning with human feedback) to ensure patient safety and provider usefulness.5 We need health professionals with clinical and computational expertise to help bridge this gap and steward this transition to ensure the safe and effective deployment of artificial intelligence technologies in health care. Although the study provides valuable insights, it also highlights the critical need for delivering precise answers and maintaining the privacy and confidentiality inherent in the doctor-patient relationship. It is crucial for the medical community to carefully integrate and scrutinize these technologies, ensuring that evaluation methods reflect true performance without overestimating capabilities while preserving patient privacy. We have a duty to ensure that the technologies and evaluation strategies we adopt meet the highest accuracy and safety standards. Only then can we truly harness the potential of AI to benefit patients and the broader medical community. We look forward to further research and dialogue on this critical topic and appreciate the efforts of all those working to advance the understanding of LLMs in health care research and practice.

Scrutinizing ChatGPT Applications in Gastroenterology: A Call for Methodological Rigor to Define Accuracy and Preserve Privacy

Giuffrè, Mauro
Primo
;
2024-01-01

Abstract

Dear Editor: We have read with great interest the recent article titled “Accuracy of ChatGPT in Common Gastrointestinal Diseases: Impact for Patients and Providers”1 evaluating ChatGPT’s performance in addressing questions related to 3 main domains: irritable bowel syndrome, inflammatory bowel disease, and colonoscopy and colorectal cancer from patient and physician perspectives. Although the study provides timely insights into how large language models (LLMs) might support clinical settings, we would like to evaluate the methodology used to measure accuracy critically. This study uses 2 classification methods to assess the accuracy of ChatGPT's responses: a detailed, granular approach that labels responses as completely accurate, completely inaccurate, partially inaccurate, or accurate with missing information; and a simpler method that categorizes responses as either accurate or inaccurate. However, answers categorized as “accurate” may contain partially inaccurate information, which is reflected by the reported overall accuracy of 75%–80% of answers that drop to 50%–55% when accuracy is evaluated using the granular approach.1 We believe that a definition of completely accurate answers best reflects the needs of clinical practice, where even minor errors can endanger patients' well-being.2 But what really defines accuracy? Information can be categorized into 2 macrocategories: world-knowledge and domain-knowledge. The data that LLMs are trained on can often be unstructured and contain conflicting information, which predominantly shapes world-knowledge. This contrasts with domain-specific knowledge, such as medical guidelines, which is typically systematic, well-organized, and curated by experts. Training on this inconsistent and sometimes inaccurate world-knowledge data can lead LLMs to generate responses that contain fact-conflicting hallucinations, where the model asserts information that is factually incorrect, reflecting the inaccuracies present within its training dataset.3 Therefore, binding accuracy to domain-knowledge is crucial when defining accuracy. If the information provided in medical guidelines can answer a given question, that corpora of text should be used to evaluate the accuracy. Otherwise, if the questions cannot be answered by the information contained in the guidelines, the same questions should be answered by world clinical experts in the specified domain, and their answers would represent the gold standard for accuracy. These criteria would reflect the importance of maintaining trust, which is fundamental in clinical practice, as we engage with these emerging technologies. Another concern is related to maintaining patient privacy by disclosing the generation process for clinical-scenario questions (eg, derived from generalized sources, such as Google Trends, or based on actual clinical cases).1 Real patient data provided to LLMs may raise concerns about data privacy and the ethical use of sensitive health information, which are crucial given the potential for data misuse on platforms using LLMs.4 These concerns underscore the importance of a secure, local hospital-based computational environment to protect patient data. In response to these real and pressing risks, partnerships between LLM and electronic health record companies (Epic and Microsoft, Meditech and Google Health, Oracle-Cerner merger) aim to safeguard any patient data processed by LLM products. Because of the current uncertainty regarding the treatment and retention of information on these platforms, it is advisable to exercise caution and refrain from using them with sensitive data that could lead to patient reidentification. We believe that using LLMs, such as GPT-4, will become a feature of patient care and provider practice in the 21st century. The role of LLMs as either a helpful advisor or an unwelcome interlocutor depends on whether they enhance patient-provider trust in health care delivery. To maintain trustworthiness, narrowly defining accuracy for medical questions is the first step to developing and deploying strategies (eg, prompt engineering, fine-tuning through reinforcement learning with human feedback) to ensure patient safety and provider usefulness.5 We need health professionals with clinical and computational expertise to help bridge this gap and steward this transition to ensure the safe and effective deployment of artificial intelligence technologies in health care. Although the study provides valuable insights, it also highlights the critical need for delivering precise answers and maintaining the privacy and confidentiality inherent in the doctor-patient relationship. It is crucial for the medical community to carefully integrate and scrutinize these technologies, ensuring that evaluation methods reflect true performance without overestimating capabilities while preserving patient privacy. We have a duty to ensure that the technologies and evaluation strategies we adopt meet the highest accuracy and safety standards. Only then can we truly harness the potential of AI to benefit patients and the broader medical community. We look forward to further research and dialogue on this critical topic and appreciate the efforts of all those working to advance the understanding of LLMs in health care research and practice.
File in questo prodotto:
File Dimensione Formato  
Scrutinizing ChatGPT.pdf

Accesso chiuso

Tipologia: Documento in Versione Editoriale
Licenza: Copyright Editore
Dimensione 1.24 MB
Formato Adobe PDF
1.24 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3089601
Citazioni
  • ???jsp.display-item.citation.pmc??? 1
  • Scopus 5
  • ???jsp.display-item.citation.isi??? ND
social impact