What question did this study set out to answer?

This study evaluates the accuracy and reliability of various AI models in providing patient education for breast reconstruction.

May 8, 2026

Comparative Evaluation of Artificial Intelligence Large Language Models for Breast Reconstruction Patient Education

Key Points

This study evaluates the accuracy and reliability of various AI models in providing patient education for breast reconstruction.
Developed 10 standardized questions focused on breast reconstruction topics.
Evaluated five AI systems: ChatGPT o3-high, ChatGPT 4.5, Grok 3, Claude Haiku 3.5, and MicroRAG.
Responses were rated by 4 plastic surgeons using a Global Quality Score from 1 to 5.
ChatGPT o3-high achieved the highest mean score of 3.73, significantly outperforming ChatGPT 4.5 (P=0.005).
MicroRAG excelled in recovery topics, achieving perfect scores (5.0) and providing evidence-based responses.
Performance varied by model, indicating distinctive strengths across the AI systems.

Abstract

Introduction Patient education is crucial for informed decision-making in breast reconstruction surgery. Large language models (LLMs) have emerged as potential tools for providing medical information, but their comparative accuracy and reliability for specialized surgical topics remain unclear. This study aims to evaluate the performance of multiple artificial intelligence (AI) models, including general-purpose LLMs and a specialized retrieval-augmented generation (RAG) system, in providing breast reconstruction patient education. Methods: We developed 10 standardized breast reconstruction questions covering reconstruction options, complications, recovery, and insurance coverage. Five AI systems were evaluated: ChatGPT o3-high, ChatGPT 4.5, Grok 3, Claude Haiku 3.5, and our specialized MicroRAG system trained on 4876 microsurgical publications. Responses were assessed using the Global Quality Score (1-5 scale) by 4 plastic surgeons, measuring accuracy, relevance, clarity, and completeness. Results: Performance varied across models and question types, with each system demonstrating distinct strengths. ChatGPT o3-high achieved the highest overall mean score (3.73), followed by Grok 3 (3.55), Claude Haiku 3.5 (3.52), MicroRAG (3.42), and ChatGPT 4.5 (3.30). MicroRAG excelled in evidence-based clinical recovery topics, achieving perfect scores (5.0) for specialized areas and providing literature-cited responses. Statistical analysis revealed that ChatGPT o3-high significantly outperformed ChatGPT 4.5 ( P = .005), while differences between other model pairs were not statistically significant. Conclusions: Different AI systems demonstrated complementary strengths for breast reconstruction patient education. While general-purpose LLMs like ChatGPT o3-high provided consistent performance across diverse patient information needs, specialized RAG systems like MicroRAG offered superior evidence-based responses in specific clinical domains. These findings indicate that healthcare providers should consider complementary system strengths and domain-specific requirements when selecting AI tools for patient education.

Bookmark

Cite This Study

Ozmen et al. (Wed,) studied this question.

synapsesocial.com/papers/69fd7e5cbfa21ec5bbf068f8 https://doi.org/https://doi.org/10.1177/22925503261446323

Bookmark