This paper presents quantumbench, a controlled experiment in multi-agent LLM-assisted scientific software development. The system implements exact analytical solutions to five Tier 2 applied quantum mechanics problems in pure Ruby and validates computed results against peer-reviewed literature values from Griffiths and Schroeter (2018). The experiment employs a two-agent architecture: Claude as architect (prompt designer) and Codex as coder (Ruby implementer). The primary finding is not about quantum mechanics. It is about the multi-agent workflow itself: Claude, acting as architect, repeatedly hallucinated experiment goals that were never stated, substituted its own interpretations despite explicit correction, and directed Codex down architecturally wrong paths. Codex performed correctly throughout, implementing what each prompt specified. The reliability bottleneck in this multi-agent system was the LLM as architect, not the LLM as the coder. Claude errors are documented in five groups ordered by severity -- goal substitution, incomplete refactors, context loss, prompt design gaps, and process violations -- with one documented Codex implementation error and one mixed-attribution historical error note in CODEXERRORS. md. All five quantum mechanics problems ultimately pass validation against Griffiths and Schroeter values.
Building similarity graph...
Analyzing shared references across papers
Loading...
T. Bass
Building similarity graph...
Analyzing shared references across papers
Loading...
T. Bass (Mon,) studied this question.
www.synapsesocial.com/papers/69df2b65e4eeef8a2a6b05c0 — DOI: https://doi.org/10.5281/zenodo.19547884