Over the past few years, large language models (LLMs) have gained utilization and attention globally, forming the backbone of deep learning models encountered in chatbots and artificial intelligence (AI) search engines. The performance of LLMs is often fine-tuned with human-preference-based methods (i.e., Reinforcement Learning from Human Feedback1), rewarding the model when responses align with user expectations. In their article, Perry et al. bring attention to this underexplored and potentially dangerous aspect of LLM performance in medicine: sycophantic behavior. Although prior studies have primarily focused on accuracy and efficiency2, this investigation examines how LLMs respond when user input is misleading, ambiguous, or factually incorrect, using orthopaedic scenarios as the testing ground. The accuracy of LLMs was assessed in 3 contexts: (1) benchmark answering of Orthopaedic In-Training Examination-like questions, (2) agreement or disagreement with controversial user beliefs, and (3) detection of inaccuracies within abstracts of highly cited orthopaedic literature. In all 3 contexts, the inclusion of hints, promptings, and user beliefs drastically reduced the accuracy of the tested LLM models. For the third context (detection of information inaccuracies), a further nuance was what the LLM got wrong; reliability was good for amending quantitative information (i.e., statistical inaccuracies) essential to the main findings of the article, but peripheral details (i.e., wrongful attribution of authorship) were not challenged. A basic degree of AI-related literacy will be increasingly required for the practicing clinician, with regard to both how to harness AI for task assistance and how to avoid pitfalls intrinsic to the technology. When utilizing LLMs, one needs to consider how each question is construed; framing in the form of hints, opinions, or misleading metadata should be avoided, as this may inadvertently reinforce errors and foster unwarranted confidence3. Patients are also increasingly conducting searches on the symptoms, disease, and management prior to consultation, an endeavor that often results in undue anxiety as their worst fears are being reinforced without assessment of information accuracy and contextualization (i.e., relevance, likelihood). Understanding how patients are construing their AI queries may help with counseling and reducing misinformation. Several limitations warrant consideration. Only 2 models (GPT-4o OpenAI and Gemini 2.5 Flash-Lite Google DeepMind) were evaluated, and results may not be generalizable to other architectures or future versions. The prompts were constructed in controlled conditions, with instructions for the binary classification of correctness or agreement, which may have restricted more subtle expressions of uncertainty4. However, the low frequency of genuinely noncommittal responses (12%) suggests that models often default to a position rather than explicitly acknowledging ambiguity. The performance of LLMs will undoubtedly improve, via training strategies that prioritize factual independence over perceived agreeableness5. At present, however, Perry et al. provide compelling evidence that LLMs do not guarantee resilience against misleading user input or ambiguous framing. As these systems are integrated into orthopaedic workflows and clinician-patient interactions, awareness of, and safeguards against, sycophantic tendencies will be essential to ensure their safe and effective use. In the era of LLMs, there remains a role for the evidence-based, well-informed, and objective clinician, who is not fixated on telling patients what they want to hear.
Shea et al. (Thu,) studied this question.