What question did this study set out to answer?

The research investigates the diagnostic accuracy of gpt-based language models in psychiatric evaluations.

April 24, 2026Open Access

Diagnostic Accuracy of GPT‐Based Large Language Models Across Versions, Prompting Techniques, and Case Presentation Formats

Key Points

The research investigates the diagnostic accuracy of gpt-based language models in psychiatric evaluations.
Analyzed 46 psychiatric cases with reference diagnoses.
Two clinical psychologists evaluated model-generated diagnoses against reference diagnoses.
Applied statistical analysis to assess effects of case format and prompting techniques on accuracy.
Strong agreement observed between clinical psychologists (kappa = 0.798).
gpt-5.1 showed almost perfect diagnostic accuracy compared to moderate accuracy for gpt-3.5.
Significant effects of prompting techniques on gpt-3.5's accuracy were noted.

Abstract

Despite the centrality of the diagnostic assessment in psychiatry, the agreement among mental health practitioners often varies from poor to moderate. The potential of large language models (LLMs; such as gpt‐based models), among other approaches, has been studied to be used as standardized tools to support clinicians’ decision‐making. The current work investigates the diagnostic accuracy of gpt‐based LLMs (gpt‐3.5 and gpt‐5.1) across different case presentation styles (i.e., vignette and outline) and prompting techniques. A total of 46 psychiatric cases with an accompanying diagnosis were used. Two trained clinical psychologists evaluated the proximity of the generated diagnosis against the reference diagnosis. A robust statistical approach was then used to investigate the effect of case format and prompt type on the average diagnostic accuracy. Importantly, accuracy in this context reflects alignment with a reference label under constrained vignette‐based inputs, rather than equivalence with comprehensive clinical diagnostic practice. The results showed a strong agreement between the ratings of the two clinical psychologists ( kappa = 0.798), with moderate agreement for gpt‐3.5’s diagnoses and almost perfect for gpt‐5.1’s diagnoses. Overall, gpt‐5.1 showed higher diagnostic accuracy and proximity to human diagnostic evaluations than gpt‐3.5 ( p < 0.001). For gpt‐3.5, a small but statistically significant main effect of prompting technique on diagnostic accuracy emerged ( p = 0.009). The highest proximity to the reference diagnosis was achieved when gpt‐3.5 was simply instructed to provide and justify a single diagnosis for each case, as compared to when it was asked to provide a diagnosis likelihood ( p < 0.001) or when it was asked to act as a clinical psychologist ( p = 0.001). Conversely, gpt‐5.1 showed high performance independent of the prompting technique and case format. Under these experimental conditions, the results of the current work provide preliminary evidence supporting the potential use of LLMs as tools to assist the diagnostic process in psychiatry and provide general indication for slightly optimizing their performance. Additionally, this study offers a methodological framework that can serve as an example for future research aiming to systematically evaluate LLMs’ diagnostic capabilities across different prompting strategies and case presentation formats.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Seraphina Fong

Alessandro Carollo

Martina Dal Maso

Journals

Human Behavior and Emerging Technologies

Actions

Institutions

University of Trieste

University of Trento

University of Hertfordshire

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Diagnostic Accuracy of GPT‐Based Large Language Models Across Versions, Prompting Techniques, and Case Presentation Formats

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider