February 10, 2026Open Access

Evaluation of large Language models on pediatric asthma: a comparative study of Claude3-Opus, Gemini 2.0, ChatGPT-4o, and DeepSeek—a cross-sectional questionnaire study

Key Points

Key points are not available for this paper at this time.

Abstract

Artificial intelligence (AI) has shown potential for enhancing medical practice and improving patient outcomes. However, the efficacy and linguistic accessibility of Large Language Models(LLMs) in pediatric asthma management remain underexplored. This study evaluated the performance of four LLMs in generating clinical information within this domains. We administrated 15 guideline-based pediatric asthma inquiries to hatGPT-4o, Claude 3 Opus, Gemini 2.0, and DeepSeek. Anonymized responses were independently evaluated by three board-certified pediatric pulmonologists using DISCERN instrument (score range 16–80). Readability was assessed using six standard indices. Inter-rater reliability was measured with intraclass correlation coefficients (ICC). Statistical analysis included repeated measures and post-hoc comparisons with effect size reporting. No significant difference was found in the overall quality of health information (DISCERN scores) among the four LLMs (F(3,56) = 0.144, p =.933, η² =0.008), with all mean scores clustered within a narrow “fair-to-good” range (50.3–51.9). However, significant differences were observed in readability: ChatGPT-4o generated significantly more comprehensible text than DeepSeek (FRE mean difference = 12.41, p =.005, Cohen’s d = 1.28), while DeepSeek performed significantly worse than all other models (all p <.05). Inter-rater reliability was high (ICC range: 0.849–0.901, all p <.001). Critically, the mean readability level of all outputs (FKGL: 13.2–14.9) far exceeded the recommended reading accessibility level for patient materials. While current LLMs can provide generally accurate information on pediatric asthma, their outputs exhibit significant limitations in readability for patient-facing use. ChatGPT‑4o shows relative advantages in comprehensibility, yet none meet recommended health-literacy standards. These findings underscore that AI should serve as a supplementary decision‑support tool under clinician supervision, not as a substitute for professional medical advice. Future work should prioritize the integration of adaptive text‑simplification features, validate AI‑generated content in real‑world clinical and caregiver settings, and expand evaluations to include emerging models and diverse chronic disease contexts.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ying-Qi Hang

jie wu

Li Bai

Journals

BMC Medical Informatics and Decision Making

Actions

Institutions

Shanghai University of Traditional Chinese Medicine

Shaanxi University of Chinese Medicine

Shanghai Traditional Chinese Medicine Hospital

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluation of large Language models on pediatric asthma: a comparative study of Claude3-Opus, Gemini 2.0, ChatGPT-4o, and DeepSeek—a cross-sectional questionnaire study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study