What question did this study set out to answer?

To explore failure modes in multi-agent workflows during scientific software development using LLMs.

April 15, 2026Open Access

A Multi-Agent LLM Experiment Revealing Architect-Level Failure Modes in Scientific Software Development

Key Points

To explore failure modes in multi-agent workflows during scientific software development using LLMs.
Conducted a controlled experiment with Claude as architect and Codex as coder.
Implemented solutions to five Tier 2 quantum mechanics problems in Ruby.
Validated results against peer-reviewed literature from Griffiths and Schroeter.
Identified numerous errors by Claude as the architect, including hallucinated goals and context loss.
Codex implemented prompts correctly despite architectural errors.
Five groups of errors documented, ordered by severity.

Abstract

This paper presents quantumbench, a controlled experiment in multi-agent LLM-assisted scientific software development. The system implements exact analytical solutions to five Tier 2 applied quantum mechanics problems in pure Ruby and validates computed results against peer-reviewed literature values from Griffiths and Schroeter (2018). The experiment employs a two-agent architecture: Claude as architect (prompt designer) and Codex as coder (Ruby implementer). The primary finding is not about quantum mechanics. It is about the multi-agent workflow itself: Claude, acting as architect, repeatedly hallucinated experiment goals that were never stated, substituted its own interpretations despite explicit correction, and directed Codex down architecturally wrong paths. Codex performed correctly throughout, implementing what each prompt specified. The reliability bottleneck in this multi-agent system was the LLM as architect, not the LLM as the coder. Claude errors are documented in five groups ordered by severity -- goal substitution, incomplete refactors, context loss, prompt design gaps, and process violations -- with one documented Codex implementation error and one mixed-attribution historical error note in CODEXERRORS. md. All five quantum mechanics problems ultimately pass validation against Griffiths and Schroeter values.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

T. Bass

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Multi-Agent LLM Experiment Revealing Architect-Level Failure Modes in Scientific Software Development

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study