AIM: This study investigates how varying prompt conditions influence the quality and clinical coherence of responses generated by a large language model (GPT-4o mini) in simulated psychiatric OSCE scenarios. METHODS: Four psychiatric OSCE cases were presented to GPT-4o mini under four conditions with increasing details: a standard clinical prompt, a context-enhanced prompt, and two variation prompts incorporating irrelevant or distracting information. GPT-4o mini was asked to perform key OSCE tasks, history-taking, risk assessment, explanation, and management for each case. Responses were scored using a standardised, structured rubric and analysed thematically. RESULTS: GPT-4o mini generated clinically relevant responses under standard and context-enhanced prompts. However, performance declined as irrelevant information was introduced. Quantitative scores dropped significantly across the different conditions, and qualitative analysis revealed reduced coherence, increased verbosity, and difficulty prioritising clinical content. CONCLUSIONS: LLMs like GPT-4o mini can generate useful responses when provided with clear and concise prompt instructions. However, in this study, we noted that clinical accuracy and coherence deteriorated in the presence of distracting or ambiguous input. This highlights the need for critical evaluation and unambiguous literacy when using LLMs in medical education.
Geng et al. (Sat,) studied this question.