Despite the centrality of the diagnostic assessment in psychiatry, the agreement among mental health practitioners often varies from poor to moderate. The potential of large language models (LLMs; such as gpt-based models), among other approaches, has been studied to be used as standardized tools to support clinicians’ decision-making. The current work investigates the diagnostic accuracy of gpt-based LLMs (gpt-3.5 and gpt-5.1) across different case presentation styles (i.e., vignette and outline) and prompting techniques. A total of 46 psychiatric cases with an accompanying diagnosis were used. Two trained clinical psychologists evaluated the proximity of the generated diagnosis against the reference diagnosis. A robust statistical approach was then used to investigate the effect of case format and prompt type on the average diagnostic accuracy. Importantly, accuracy in this context reflects alignment with a reference label under constrained vignette-based inputs, rather than equivalence with comprehensive clinical diagnostic practice. The results showed a strong agreement between the ratings of the two clinical psychologists (kappa = 0.798), with moderate agreement for gpt-3.5’s diagnoses and almost perfect for gpt-5.1’s diagnoses. Overall, gpt-5.1 showed higher diagnostic accuracy and proximity to human diagnostic evaluations than gpt-3.5 (p < 0.001). For gpt-3.5, a small but statistically significant main effect of prompting technique on diagnostic accuracy emerged (p = 0.009). The highest proximity to the reference diagnosis was achieved when gpt-3.5 was simply instructed to provide and justify a single diagnosis for each case, as compared to when it was asked to provide a diagnosis likelihood (p < 0.001) or when it was asked to act as a clinical psychologist (p = 0.001). Conversely, gpt-5.1 showed high performance independent of the prompting technique and case format. Under these experimental conditions, the results of the current work provide preliminary evidence supporting the potential use of LLMs as tools to assist the diagnostic process in psychiatry and provide general indication for slightly optimizing their performance. Additionally, this study offers a methodological framework that can serve as an example for future research aiming to systematically evaluate LLMs’ diagnostic capabilities across different prompting strategies and case presentation formats.

Diagnostic Accuracy of GPT-Based Large Language Models Across Versions, Prompting Techniques, and Case Presentation Formats / Fong, S.; Carollo, A.; Maso, M. D.; Martinotti, G.; Luciani, D.; Khan, Y. S.; Pellegrini, L.; Corazza, O.; Esposito, G.. - In: HUMAN BEHAVIOR AND EMERGING TECHNOLOGIES. - ISSN 2578-1863. - 2026:1(2026), pp. 4674484.--4674484.-. [10.1155/hbe2/4674484]

Diagnostic Accuracy of GPT-Based Large Language Models Across Versions, Prompting Techniques, and Case Presentation Formats

Carollo A.;Pellegrini L.;
2026-01-01

Abstract

Despite the centrality of the diagnostic assessment in psychiatry, the agreement among mental health practitioners often varies from poor to moderate. The potential of large language models (LLMs; such as gpt-based models), among other approaches, has been studied to be used as standardized tools to support clinicians’ decision-making. The current work investigates the diagnostic accuracy of gpt-based LLMs (gpt-3.5 and gpt-5.1) across different case presentation styles (i.e., vignette and outline) and prompting techniques. A total of 46 psychiatric cases with an accompanying diagnosis were used. Two trained clinical psychologists evaluated the proximity of the generated diagnosis against the reference diagnosis. A robust statistical approach was then used to investigate the effect of case format and prompt type on the average diagnostic accuracy. Importantly, accuracy in this context reflects alignment with a reference label under constrained vignette-based inputs, rather than equivalence with comprehensive clinical diagnostic practice. The results showed a strong agreement between the ratings of the two clinical psychologists (kappa = 0.798), with moderate agreement for gpt-3.5’s diagnoses and almost perfect for gpt-5.1’s diagnoses. Overall, gpt-5.1 showed higher diagnostic accuracy and proximity to human diagnostic evaluations than gpt-3.5 (p < 0.001). For gpt-3.5, a small but statistically significant main effect of prompting technique on diagnostic accuracy emerged (p = 0.009). The highest proximity to the reference diagnosis was achieved when gpt-3.5 was simply instructed to provide and justify a single diagnosis for each case, as compared to when it was asked to provide a diagnosis likelihood (p < 0.001) or when it was asked to act as a clinical psychologist (p = 0.001). Conversely, gpt-5.1 showed high performance independent of the prompting technique and case format. Under these experimental conditions, the results of the current work provide preliminary evidence supporting the potential use of LLMs as tools to assist the diagnostic process in psychiatry and provide general indication for slightly optimizing their performance. Additionally, this study offers a methodological framework that can serve as an example for future research aiming to systematically evaluate LLMs’ diagnostic capabilities across different prompting strategies and case presentation formats.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3134739
 Avviso

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact