Diagnostic Accuracy of GPT-Based Large Language Models Across Versions, Prompting Techniques, and Case Presentation Formats

Fong, S.; Carollo, A.; Maso, M. D.; Martinotti, G.; Luciani, D.; Khan, Y. S.; Pellegrini, L.; Corazza, O.; Esposito, G.

doi:10.1155/hbe2/4674484

Despite the centrality of the diagnostic assessment in psychiatry, the agreement among mental health practitioners often varies from poor to moderate. The potential of large language models (LLMs; such as gpt-based models), among other approaches, has been studied to be used as standardized tools to support clinicians’ decision-making. The current work investigates the diagnostic accuracy of gpt-based LLMs (gpt-3.5 and gpt-5.1) across different case presentation styles (i.e., vignette and outline) and prompting techniques. A total of 46 psychiatric cases with an accompanying diagnosis were used. Two trained clinical psychologists evaluated the proximity of the generated diagnosis against the reference diagnosis. A robust statistical approach was then used to investigate the effect of case format and prompt type on the average diagnostic accuracy. Importantly, accuracy in this context reflects alignment with a reference label under constrained vignette-based inputs, rather than equivalence with comprehensive clinical diagnostic practice. The results showed a strong agreement between the ratings of the two clinical psychologists (kappa = 0.798), with moderate agreement for gpt-3.5’s diagnoses and almost perfect for gpt-5.1’s diagnoses. Overall, gpt-5.1 showed higher diagnostic accuracy and proximity to human diagnostic evaluations than gpt-3.5 (p < 0.001). For gpt-3.5, a small but statistically significant main effect of prompting technique on diagnostic accuracy emerged (p = 0.009). The highest proximity to the reference diagnosis was achieved when gpt-3.5 was simply instructed to provide and justify a single diagnosis for each case, as compared to when it was asked to provide a diagnosis likelihood (p < 0.001) or when it was asked to act as a clinical psychologist (p = 0.001). Conversely, gpt-5.1 showed high performance independent of the prompting technique and case format. Under these experimental conditions, the results of the current work provide preliminary evidence supporting the potential use of LLMs as tools to assist the diagnostic process in psychiatry and provide general indication for slightly optimizing their performance. Additionally, this study offers a methodological framework that can serve as an example for future research aiming to systematically evaluate LLMs’ diagnostic capabilities across different prompting strategies and case presentation formats.

Diagnostic Accuracy of GPT-Based Large Language Models Across Versions, Prompting Techniques, and Case Presentation Formats / Fong, S., Carollo, A., Maso, M.D., Martinotti, G., Luciani, D., Khan, Y.S., Pellegrini, L., Corazza, O., Esposito, G.. - In: HUMAN BEHAVIOR AND EMERGING TECHNOLOGIES. - ISSN 2578-1863. - 2026:1(2026), pp. 4674484.--4674484.-. [10.1155/hbe2/4674484]