What question did this study set out to answer?

This study aimed to evaluate the clinical safety of two large language models in oral cancer-related patient communication.

May 7, 2026Open Access

Evaluating the clinical safety of large language models in oral cancer-related patient communication: a repeated-prompt observational study

Key Points

This study aimed to evaluate the clinical safety of two large language models in oral cancer-related patient communication.
Repeated-prompt observational study using 20 standardized Turkish-language scenarios submitted daily to each model for 7 days.
Assessed scientific accuracy, completeness, readability, and referral safety using established scales and indices.
Evaluated internal consistency and inter-model agreement with statistical measures including Cronbach’s alpha and ICC.
Both models showed comparable scientific accuracy (Gemini: 3.52 ± 0.57; Grok: 3.39 ± 0.68; p = 0.072) and completeness (Gemini: 3.40 ± 0.70; Grok: 3.25 ± 0.78; p = 0.091).
Referral safety was high for both models (Gemini: 90.0%; Grok: 92.1%), but Gemini had under-referrals in high-risk cases.
Readability scores were similar, but Grok produced longer sentences (p = 0.0005; Cohen’s d = 2.50) indicating more linguistic complexity.

Abstract

As patients increasingly consult large language models (LLMs) for health-related information, evaluating the clinical safety of AI-generated responses has become essential, particularly in high-risk domains such as oral oncology. Despite growing interest in AI applications in medicine, evidence regarding response consistency and referral safety in patient communication remains limited. This study aimed to assess the clinical safety of two contemporary LLMs in oral cancer–related patient scenarios using a multidimensional evaluation framework. This repeated-prompt observational study evaluated Google Gemini (Pro version) and xAI Grok (Grok-1) over a 7-day period. Twenty standardized Turkish-language patient scenarios related to suspected oral cancer were submitted daily to each model, generating 280 responses. Scientific accuracy and completeness were assessed using a 5-point Likert scale by two independent oral and maxillofacial radiologists. Readability was evaluated using validated Turkish indices (Ateşman and Bezirci–Yılmaz). Referral safety was assessed as a binary outcome. Internal consistency across repeated prompts was measured using Cronbach’s alpha, and inter-model agreement was analyzed using intraclass correlation coefficients (ICC). Both models demonstrated comparable levels of scientific accuracy (Gemini: 3.52 ± 0.57; Grok: 3.39 ± 0.68; p = 0.072) and completeness (3.40 ± 0.70 vs. 3.25 ± 0.78; p = 0.091). Overall referral safety was high (Gemini: 90.0%; Grok: 92.1%), although the Gemini model failed to recommend professional consultation in two high-risk scenarios involving suspected malignancy. In contrast, Grok consistently recommended referral across all scenarios. Readability scores were similar between models; however, Grok generated significantly longer sentences (p = 0.0005; Cohen’s d = 2.50), indicating increased linguistic complexity. Internal consistency was high for both models (Gemini α = 0.942; Grok α = 0.886), whereas inter-model agreement was moderate (ICC: 0.50–0.58). Contemporary LLMs demonstrate generally acceptable accuracy and a precautionary approach in oral cancer–related communication. However, variability in referral behavior and linguistic structure, along with occasional under-referral in high-risk scenarios, highlights potential clinical risks. While these systems may support patient education and triage, they should be considered adjunctive tools and not substitutes for professional evaluation. Further research is needed to optimize the balance between clinical caution and appropriate guidance in AI-assisted healthcare communication.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluating the clinical safety of large language models in oral cancer-related patient communication: a repeated-prompt observational study

Key Points

Abstract

Cite This Study