What does this research mean for the field?

A genetic neuro-symbolic large language model system achieves superior accuracy in cholangitis management decisions compared to conventional AI models and human medical experts. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to evaluate the effectiveness of a neuro-symbolic large language model in cholangitis management compared to other AI models and human experts.

June 3, 2026Open Access

Performance comparison of a neuro-symbolic large language model system versus conventional AI models and human experts in cholangitis management

Key Points

This research aims to evaluate the effectiveness of a neuro-symbolic large language model in cholangitis management compared to other AI models and human experts.
Multi-center cross-sectional study involving 30 case-based questions from ABIM gastroenterology exams.
Compared performance of a genetic neuro-symbolic LLM system against Claude 4.5 Sonnet, ChatGPT 5.2, Gemini 2.0 Flash, and human experts.
Participants included 10 gastroenterology specialists and 4 emergency medicine physicians from four tertiary centers in Turkey.
The genetic neuro-symbolic LLM system achieved 100% accuracy (30/30), outperforming Claude 4.5 Sonnet (90.0%) and ChatGPT 5.2 (60.0%).
Gastroenterology specialists showed a mean accuracy of 95.7% ± 3.2%, significantly higher than emergency physicians (mean 84.2% ± 8.8%, p = 0.012).
Neuro-symbolic system offered superior performance across diagnosis, treatment, and complications categories, with non-inferior performance compared to Gemini 2.0 Flash (p = 0.034).

Abstract

Large language models (LLMs) have shown promising results in medical decision support; Background: Large language models (LLMs) have demonstrated promising outcomes in medical decision support; however, their efficacy in managing complex hepatobiliary conditions remains insufficiently examined. We have developed a genetic neuro-symbolic LLM system that integrates multiple AI agents with neural-symbolic reasoning for the management of cholangitis, and we have compared its performance to that of conventional LLMs and human experts.genetic neuro-symbolic LLM system integrating multiple AI agents with neural-symbolic reasoning for cholangitis management and compared its performance against conventional LLMs and human experts. This multi-center cross-sectional study included 30 case-based questions from American Board of Internal Medicine (ABIM) gastroenterology subspecialty examinations covering acute cholangitis. Questions were categorized into diagnosis (n = 10), treatment (n = 10), and complications/prognosis (n = 10). Performance of a genetic neuro-symbolic LLM system orchestrated via LangGraph was compared against Claude 4.5 Sonnet, ChatGPT 5.2, Gemini 2.0 Flash, 10 gastroenterology specialists, and 4 emergency medicine physicians from four tertiary centers in Turkey. The genetic neuro-symbolic system achieved the highest overall accuracy (100%, 30/30), significantly outperforming Claude 4.5 Sonnet (90.0%), ChatGPT 5.2 (60.0%), Gemini 2.0 Flash (63.3%), gastroenterology experts (mean 95.7% ± 3.2%), and emergency medicine physicians (mean 84.2% ± 8.8%). The neuro-symbolic system demonstrated superior performance across all categories and cholangitis subtypes. Among human participants, gastroenterologists outperformed emergency physicians in treatment decisions (p = 0.012) and showed non-inferior performance to Gemini 2.0 Flash overall (p = 0.034). The genetic neuro-symbolic LLM system demonstrated superior accuracy in cholangitis management compared to all conventional AI models and human experts. This proof-of-concept study suggests that multi-agent architectures with neural-symbolic reasoning may offer a promising direction for AI-assisted clinical decision support in complex hepatobiliary conditions, although prospective clinical validation is required before broader implementation claims can be warranted.

Mark Helpful

Bookmark

Relay

View Full Paper