Introduction Patient education is crucial for informed decision-making in breast reconstruction surgery. Large language models (LLMs) have emerged as potential tools for providing medical information, but their comparative accuracy and reliability for specialized surgical topics remain unclear. This study aims to evaluate the performance of multiple artificial intelligence (AI) models, including general-purpose LLMs and a specialized retrieval-augmented generation (RAG) system, in providing breast reconstruction patient education. Methods: We developed 10 standardized breast reconstruction questions covering reconstruction options, complications, recovery, and insurance coverage. Five AI systems were evaluated: ChatGPT o3-high, ChatGPT 4.5, Grok 3, Claude Haiku 3.5, and our specialized MicroRAG system trained on 4876 microsurgical publications. Responses were assessed using the Global Quality Score (1-5 scale) by 4 plastic surgeons, measuring accuracy, relevance, clarity, and completeness. Results: Performance varied across models and question types, with each system demonstrating distinct strengths. ChatGPT o3-high achieved the highest overall mean score (3.73), followed by Grok 3 (3.55), Claude Haiku 3.5 (3.52), MicroRAG (3.42), and ChatGPT 4.5 (3.30). MicroRAG excelled in evidence-based clinical recovery topics, achieving perfect scores (5.0) for specialized areas and providing literature-cited responses. Statistical analysis revealed that ChatGPT o3-high significantly outperformed ChatGPT 4.5 ( P = .005), while differences between other model pairs were not statistically significant. Conclusions: Different AI systems demonstrated complementary strengths for breast reconstruction patient education. While general-purpose LLMs like ChatGPT o3-high provided consistent performance across diverse patient information needs, specialized RAG systems like MicroRAG offered superior evidence-based responses in specific clinical domains. These findings indicate that healthcare providers should consider complementary system strengths and domain-specific requirements when selecting AI tools for patient education.
Ozmen et al. (Wed,) studied this question.