Model-based reinforcement learning agents that plan entirely in imagination can achieve high imagined returns while completely failing the actual task — a failure mode we term the exploitation gap. We provide the first systematic characterisation of this gap in DreamerV3 on AntMaze, where the world model receives near-zero reward from real experience. Instrumenting the training loop with four new metrics, we show that the imagined-to-real reward ratio reaches approximately 50x at 500k environment steps while evaluation return stays below 0.05. We establish that KL divergence collapse is a leading indicator of exploitation onset with a approximately 50k step lag (r = -0.91, p < 0.001), providing an actionable early-warning signal. Comparing to the hierarchical baseline THICK, we show that sparse context-kernel gating reduces but does not eliminate the gap. A dense-reward ablation confirms that rich reward signal suppresses exploitation entirely. We propose three KL-aware mitigation strategies and release all experimental infrastructure for reproducibility.
Building similarity graph...
Analyzing shared references across papers
Loading...
Arkat Khassanov Arkat Khassanov (Thu,) studied this question.
www.synapsesocial.com/papers/69f44325967e944ac55667ca — DOI: https://doi.org/10.5281/zenodo.19894702
Arkat Khassanov Arkat Khassanov
Astana Medical University
Building similarity graph...
Analyzing shared references across papers
Loading...