Background and Aims: Advances in artificial intelligence, particularly large language models (LLMs), hold promise for transforming chronic disease management such as Hepatitis C Virus (HCV) infection. This study evaluates the impact of retrieval-augmented generation (RAG) and supervised fine-tuning (SFT) on both open-ended question answering (accuracy and clarity) and on LLM-recommended treatment regimens for clinical scenarios. Methods: We employed OpenAI's GPT-4 Turbo in four configurations—baseline, RAG-Top1, RAG-Top 10 and SFT—using the 2020 EASL HCV guidelines as external knowledge or fine-tuning data. For the question set, guidelines were segmented at the paragraph level and encoded into 3072-dimensional embeddings. Fifteen questions covering general, patient and physician perspectives were scored on a 10-point accuracy scale and binary accuracy/clarity by four experts. Separately, we created 25 simulated clinical scenarios; a consensus of four hepatologists defined the gold-standard DAA regimens. Model performance on these cases was measured by two metrics: ‘partial accuracy’ (≥ one correct DAA without errors) and ‘complete accuracy’ (all correct DAAs without errors). Results: On open-ended questions, RAG-Top10 outperformed baseline in accuracy (91.7% vs. 36.6%; p < 0.001) and clarity (91.7% vs. 46.6%; p < 0.001). RAG-Top1 achieved 81.7% accuracy and 86.6% clarity (both p < 0.001), while SFT reached 71.7% accuracy and 88.3% clarity (p < 0.001). Similarly, RAG-Top10 achieved the highest performance in prescribing the correct DAA regimen according to expert consensus in 76% of cases (vs. 24% for baseline model, p < 0.001). Conclusions: Both RAG-Top10 and SFT markedly enhance LLM performance in guideline-driven HCV management—improving not only response accuracy and clarity but also DAA selection in clinical scenarios. RAG-Top10's broader context retrieval confers the greatest gains, while SFT underscores the value of domain-specific alignment. Rigorous, expert-informed evaluation frameworks are essential for the safe integration of LLMs into clinical practice.

From Guidelines to Real-Time Conversation: Expert-Validated Retrieval-Augmented and Fine-Tuned GPT-4 for Hepatitis C Management / Giuffrè, Mauro; Pugliese, Nicola; Kresevic, Simone; Ajcevic, Milos; Negro, Francesco; Puoti, Massimo; Forns, Xavier; Pawlotsky, Jean-Michel; Shung, Dennis L.; Aghemo, Alessio. - In: LIVER INTERNATIONAL. - ISSN 1478-3223. - ELETTRONICO. - 45:10(2025), pp. e70349."-"-e70349."-". [10.1111/liv.70349]

From Guidelines to Real-Time Conversation: Expert-Validated Retrieval-Augmented and Fine-Tuned GPT-4 for Hepatitis C Management

Giuffrè, Mauro
Co-primo
;
Kresevic, Simone;
2025-01-01

Abstract

Background and Aims: Advances in artificial intelligence, particularly large language models (LLMs), hold promise for transforming chronic disease management such as Hepatitis C Virus (HCV) infection. This study evaluates the impact of retrieval-augmented generation (RAG) and supervised fine-tuning (SFT) on both open-ended question answering (accuracy and clarity) and on LLM-recommended treatment regimens for clinical scenarios. Methods: We employed OpenAI's GPT-4 Turbo in four configurations—baseline, RAG-Top1, RAG-Top 10 and SFT—using the 2020 EASL HCV guidelines as external knowledge or fine-tuning data. For the question set, guidelines were segmented at the paragraph level and encoded into 3072-dimensional embeddings. Fifteen questions covering general, patient and physician perspectives were scored on a 10-point accuracy scale and binary accuracy/clarity by four experts. Separately, we created 25 simulated clinical scenarios; a consensus of four hepatologists defined the gold-standard DAA regimens. Model performance on these cases was measured by two metrics: ‘partial accuracy’ (≥ one correct DAA without errors) and ‘complete accuracy’ (all correct DAAs without errors). Results: On open-ended questions, RAG-Top10 outperformed baseline in accuracy (91.7% vs. 36.6%; p < 0.001) and clarity (91.7% vs. 46.6%; p < 0.001). RAG-Top1 achieved 81.7% accuracy and 86.6% clarity (both p < 0.001), while SFT reached 71.7% accuracy and 88.3% clarity (p < 0.001). Similarly, RAG-Top10 achieved the highest performance in prescribing the correct DAA regimen according to expert consensus in 76% of cases (vs. 24% for baseline model, p < 0.001). Conclusions: Both RAG-Top10 and SFT markedly enhance LLM performance in guideline-driven HCV management—improving not only response accuracy and clarity but also DAA selection in clinical scenarios. RAG-Top10's broader context retrieval confers the greatest gains, while SFT underscores the value of domain-specific alignment. Rigorous, expert-informed evaluation frameworks are essential for the safe integration of LLMs into clinical practice.
File in questo prodotto:
File Dimensione Formato  
Liver International - 2025 - Giuffrè - From Guidelines to Real‐Time Conversation Expert‐Validated Retrieval‐Augmented and (5).pdf

accesso aperto

Tipologia: Documento in Versione Editoriale
Licenza: Creative commons
Dimensione 1.48 MB
Formato Adobe PDF
1.48 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3135149
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact