I present ARIA, a multi-agent deliberation system that achieves 92.4% accuracy on the full 198-question GPQA Diamond benchmark — exceeding every constituent model by +4.0 percentage points — through structured adversarial deliberation across six heterogeneous large language model architectures. The system operates a three-round protocol (reconnaissance, analysis, synthesis) where independent analysts powered by different model families (Gemini, Grok, Qwen, Claude, Mistral, GPT) debate before a chairman synthesizes a final verdict by weighing argument quality rather than counting votes. I introduce a three-loop learning architecture: persona-level reinforcement learning that adjusts agent influence based on rolling accuracy, an outcome-linked memory lifecycle that scores and prunes agent knowledge through nightly consolidation, and human-curated skill injection that seeds domain expertise into agent prompts for organic absorption and RL validation. On GPQA Diamond, the board recovers correct answers on 21 questions where most or all individual models fail (including 3 where zero models answer correctly), showing that synthesis adds genuine reasoning value beyond majority voting (+5.1pp). In live financial decision-making over 147 board meetings (376 scored verdicts), the system produces monotonically calibrated conviction scores (73.5% accuracy at high conviction vs 35.7% at low conviction) and maintains deliberation diversity through empirical dissent weighting. I argue that structured multi-agent deliberation across architecturally diverse models, combined with outcome-linked learning loops, is a general reasoning amplifier — not a domain-specific tool.
Building similarity graph...
Analyzing shared references across papers
Loading...
Philip Breisner
Building similarity graph...
Analyzing shared references across papers
Loading...
Philip Breisner (Fri,) studied this question.
www.synapsesocial.com/papers/69b606af83145bc643d1cd0c — DOI: https://doi.org/10.5281/zenodo.18997887
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: