What question did this study set out to answer?

This study evaluates three configurations of GPT-4o for generating Objective Structured Clinical Examinations in digital health.

January 18, 2026Open Access

AI-Driven Objective Structured Clinical Examination Generation in Digital Health Education: Comparative Analysis of Three GPT-4o Configurations

Key Points

This study evaluates three configurations of GPT-4o for generating Objective Structured Clinical Examinations in digital health.
Generated 24 OSCE stations across 8 digital health topics using three different GPT-4o configurations.
Evaluated format compliance by one expert and educational content by two independent experts.
Used comprehensive assessment grid for evaluation and performed statistical analyses with Kruskal-Wallis tests.
Simulated-agents GPT outperformed others in format compliance and content quality criteria.
Achieved accuracy mean of 4.47/5 and clarity mean of 4.46/5, with significant P-values.
Showed 88% usability without major revisions and ranked first preference among configurations.

Abstract

Background Objective Structured Clinical Examinations (OSCEs) are used as an evaluation method in medical education, but require significant pedagogical expertise and investment, especially in emerging fields like digital health. Large language models (LLMs), such as ChatGPT (OpenAI), have shown potential in automating educational content generation. However, OSCE generation using LLMs remains underexplored. Objective This study aims to evaluate 3 GPT-4o configurations for generating OSCE stations in digital health: (1) standard GPT with a simple prompt and OSCE guidelines; (2) personalized GPT with a simple prompt, OSCE guidelines, and a reference book in digital health; and (3) simulated-agents GPT with a structured prompt simulating specialized OSCE agents and the digital health reference book. Methods Overall, 24 OSCE stations were generated across 8 digital health topics with each GPT-4o configuration. Format compliance was evaluated by one expert, while educational content was assessed independently by 2 digital health experts, blind to GPT-4o configurations, using a comprehensive assessment grid. Statistical analyses were performed using Kruskal-Wallis tests. Results Simulated-agents GPT performed best in format compliance and most content quality criteria, including accuracy (mean 4.47/5, SD 0.28; P=.01) and clarity (mean 4.46/5, SD 0.52; P=.004). It also had 88% (14/16) for usability without major revisions and first-place preference ranking, outperforming the other configurations. Personalized GPT showed the lowest format compliance, while standard GPT scored lowest for clarity and educational value. Conclusions Structured prompting strategies, particularly agents’ simulation, enhance the reliability and usability of LLM-generated OSCE content. These results support the use of artificial intelligence in medical education, while confirming the need for expert validation.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zineb Zouakia

Emmanuel Logak

Alan Szymczak

Journals

JMIR Medical Education

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

AI-Driven Objective Structured Clinical Examination Generation in Digital Health Education: Comparative Analysis of Three GPT-4o Configurations

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study