What question did this study set out to answer?

The aim is to develop a deliberation system that improves decision-making accuracy using multi-agent interactions and reinforcement learning.

March 15, 2026Open Access

Disagreement Is All You Need: Adversarial Multi-Agent Deliberation with Three-Loop Reinforcement Learning

Key Points

The aim is to develop a deliberation system that improves decision-making accuracy using multi-agent interactions and reinforcement learning.
Developed ARIA, a multi-agent deliberation system using six different large language models.
Implemented a three-round deliberation protocol including reconnaissance, analysis, and synthesis phases.
Used a three-loop learning architecture that integrates persona-level reinforcement learning and human-curated skill injection.
Achieved 92.4% accuracy on the GPQA Diamond benchmark, outperforming other models by 4.0 percentage points.
Improved answers on 21 questions where other models failed, demonstrating added reasoning value from synthesis.
Produced calibrated conviction scores with 73.5% accuracy at high conviction and 35.7% at low conviction.

Abstract

I present ARIA, a multi-agent deliberation system that achieves 92.4% accuracy on the full 198-question GPQA Diamond benchmark — exceeding every constituent model by +4.0 percentage points — through structured adversarial deliberation across six heterogeneous large language model architectures. The system operates a three-round protocol (reconnaissance, analysis, synthesis) where independent analysts powered by different model families (Gemini, Grok, Qwen, Claude, Mistral, GPT) debate before a chairman synthesizes a final verdict by weighing argument quality rather than counting votes. I introduce a three-loop learning architecture: persona-level reinforcement learning that adjusts agent influence based on rolling accuracy, an outcome-linked memory lifecycle that scores and prunes agent knowledge through nightly consolidation, and human-curated skill injection that seeds domain expertise into agent prompts for organic absorption and RL validation. On GPQA Diamond, the board recovers correct answers on 21 questions where most or all individual models fail (including 3 where zero models answer correctly), showing that synthesis adds genuine reasoning value beyond majority voting (+5.1pp). In live financial decision-making over 147 board meetings (376 scored verdicts), the system produces monotonically calibrated conviction scores (73.5% accuracy at high conviction vs 35.7% at low conviction) and maintains deliberation diversity through empirical dissent weighting. I argue that structured multi-agent deliberation across architecturally diverse models, combined with outcome-linked learning loops, is a general reasoning amplifier — not a domain-specific tool.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Philip Breisner

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Disagreement Is All You Need: Adversarial Multi-Agent Deliberation with Three-Loop Reinforcement Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider