Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI's GPT-3.5/4/4o/o1-preview, Anthropic's Claude-3-Opus, Meta's LLaMA-2 (7B/13B/70B), and Mistral AI's Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ = 0.81-0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.
Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology / Giuffrè, Mauro; You, Kisung; Pang, Ziteng; Kresevic, Simone; Chung, Sunny; Chen, Ryan; Ko, Youngmin; Chan, Colleen.; Saarinen, Theo; Ajcevic, Milos; Crocè, Lory S.; Garcia-Tsao, Guadalupe; Gralnek, Ian; Sung, Joseph J. Y.; Barkun, Alan; Laine, Loren; Sekhon, Jasjeet; Stadie, Bradly; Shung, Dennis L.. - In: NPJ DIGITAL MEDICINE. - ISSN 2398-6352. - ELETTRONICO. - 8:1(2025), pp. 242."-"-242."-". [10.1038/s41746-025-01589-z]
Expert of Experts Verification and Alignment (EVAL) Framework for Large Language Models Safety in Gastroenterology
Giuffrè, MauroPrimo
;Kresevic, Simone;Ajcevic, Milos;Crocè, Lory S.;
2025-01-01
Abstract
Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI's GPT-3.5/4/4o/o1-preview, Anthropic's Claude-3-Opus, Meta's LLaMA-2 (7B/13B/70B), and Mistral AI's Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ = 0.81-0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.| File | Dimensione | Formato | |
|---|---|---|---|
|
s41746-025-01589-z.pdf
accesso aperto
Tipologia:
Documento in Versione Editoriale
Licenza:
Creative commons
Dimensione
1.96 MB
Formato
Adobe PDF
|
1.96 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


