As patients increasingly consult large language models (LLMs) for health-related information, evaluating the clinical safety of AI-generated responses has become essential, particularly in high-risk domains such as oral oncology. Despite growing interest in AI applications in medicine, evidence regarding response consistency and referral safety in patient communication remains limited. This study aimed to assess the clinical safety of two contemporary LLMs in oral cancer–related patient scenarios using a multidimensional evaluation framework. This repeated-prompt observational study evaluated Google Gemini (Pro version) and xAI Grok (Grok-1) over a 7-day period. Twenty standardized Turkish-language patient scenarios related to suspected oral cancer were submitted daily to each model, generating 280 responses. Scientific accuracy and completeness were assessed using a 5-point Likert scale by two independent oral and maxillofacial radiologists. Readability was evaluated using validated Turkish indices (Ateşman and Bezirci–Yılmaz). Referral safety was assessed as a binary outcome. Internal consistency across repeated prompts was measured using Cronbach’s alpha, and inter-model agreement was analyzed using intraclass correlation coefficients (ICC). Both models demonstrated comparable levels of scientific accuracy (Gemini: 3.52 ± 0.57; Grok: 3.39 ± 0.68; p = 0.072) and completeness (3.40 ± 0.70 vs. 3.25 ± 0.78; p = 0.091). Overall referral safety was high (Gemini: 90.0%; Grok: 92.1%), although the Gemini model failed to recommend professional consultation in two high-risk scenarios involving suspected malignancy. In contrast, Grok consistently recommended referral across all scenarios. Readability scores were similar between models; however, Grok generated significantly longer sentences (p = 0.0005; Cohen’s d = 2.50), indicating increased linguistic complexity. Internal consistency was high for both models (Gemini α = 0.942; Grok α = 0.886), whereas inter-model agreement was moderate (ICC: 0.50–0.58). Contemporary LLMs demonstrate generally acceptable accuracy and a precautionary approach in oral cancer–related communication. However, variability in referral behavior and linguistic structure, along with occasional under-referral in high-risk scenarios, highlights potential clinical risks. While these systems may support patient education and triage, they should be considered adjunctive tools and not substitutes for professional evaluation. Further research is needed to optimize the balance between clinical caution and appropriate guidance in AI-assisted healthcare communication.
Kollayan et al. (Tue,) studied this question.