What question did this study set out to answer?

This research assesses the quality of ChatGPT's responses to pediatric craniofacial surgery questions using various evaluation metrics.

February 26, 2026

Evaluating ChatGPT in Pediatric Craniofacial Surgery Counseling: A Vignette-Based Assessment of Educational Quality, Specificity, Readability, and Emotional Content

Puntos clave

This research assesses the quality of ChatGPT's responses to pediatric craniofacial surgery questions using various evaluation metrics.
Developed 12 vignettes covering cleft lip and palate, craniosynostosis, facial trauma, and otoplasty.
Evaluated responses using DISCERN, specificity ratings, FKGL for readability, emotion scoring, and PEMAT.
Two board-certified plastic surgeons rated DISCERN scores, while medical students evaluated specificity and emotion.
Mean DISCERN score was 43.7/75, indicating moderate quality.
Specificity ratings varied, with craniosynostosis scoring the lowest.
The average FKGL was 9.5, higher than the recommended 6th-8th grade level.
Mean emotion score was 3.1, reflecting moderate emotional tone.
PEMAT scores averaged 62% for understandability but only 27% for actionability.

Resumen

Introduction: Large language models (LLMs) like ChatGPT have the potential to improve patient education. Their role in pediatric plastic surgery counseling remains underexplored. This study evaluated ChatGPT-4o’s responses to common parent questions across 4 pediatric craniofacial procedures using 5 metrics: DISCERN, specificity, Flesch-Kincaid Grade Level (FKGL), emotion scoring, and Patient Education Materials Assessment (PEMAT). Methods: Twelve standardized vignettes were developed for cleft lip and palate, craniosynostosis, facial trauma from a dog bite, and otoplasty. Each case featured prompts on surgical risks, recovery, and procedure-specific concerns. All were submitted on the same day using the same ChatGPT-4o profile. DISCERN scores were rated by 2 board-certified plastic surgeons. Specificity and emotion were rated on a 5-point Likert scale by 2 medical students. Readability was calculated with FKGL. PEMAT was used to assess understandability and actionability. Results: Mean DISCERN score was 43.7/75 (reliability 23.8/40, treatment quality 20.3/35). Mean specificity ranged from 1.7 (craniosynostosis) to 3.0 (otoplasty and dog bite). Average FKGL was 9.5 (10th-grade level). Mean emotion score was 3.1. PEMAT scores averaged 62% for understandability and 27% for actionability. Facial trauma demonstrated the highest in both domains. Conclusions: ChatGPT-4o produced organized, accessible responses, but underperformed in reliability, quality, specificity, and actionability. Reading level exceeds recommended patient education standards of sixth to eighth grade. Emotional tone was moderate but not consistently tailored to sensitive pediatric contexts. These findings suggest ChatGPT is insufficient for unsupervised use. With refinement, LLMs may serve as support, but not replace, physician-led counseling in pediatric craniofacial surgery.

Me gusta

Guardar

Me gusta

Guardar

Evaluating ChatGPT in Pediatric Craniofacial Surgery Counseling: A Vignette-Based Assessment of Educational Quality, Specificity, Readability, and Emotional Content

Puntos clave

Resumen

Cite This Study