Traditional reinforcement learning (RL) typically treats time as an externally fixed budget—forexample, a maximum episode length or a discount factor—which truncates agent trajectoriesindependently of the agent’s behavior. This paper explores an alternative perspective in which timeis considered a resource that the agent can influence through its exploratory actions. We describe asimple augmentation to the state space that includes the remaining time budget, and we define theper-step time consumption as a decreasing function of the novelty of the visited state (measured byepisode-based visit counts). This design allows the agent to reduce its time consumption inunfamiliar regions, thereby effectively extending the number of steps available to reach sparserewards.We test this mechanism in the MiniGrid-Empty-8x8 environment using a controlled causalidentification protocol with six orthogonal ablations across 15 random seeds. Our full modelachieves a mean success rate of 10.4% (90% CI 3.7%, 17.8%), whereas all baseline conditions(Reward Only, Random Time, Time Only, Permuted Time Placebo) yield success rates below 2.0%.We then examine a distribution shift scenario: agents trained only on the 8x8 maze are transferreddirectly to a 16x16 maze (shortest path ≈30 steps) while the initial time budget remains fixed at 12steps. The placebo model—which shares the same intrinsic reward and empirical time-consumptiondistribution as our full model but lacks the state-dependent coupling—shows a success rate of 2.0%in the larger environment. In contrast, our full model achieves a success rate of 12.4% (p = 0.0305vs. placebo) and, in some seeds, reaches the goal in 40 physical steps (3.33 times the initial budget).A Wasserstein distance analysis indicates a measurable shift in the per-step time consumptiondistribution between the 8x8 and 16x16 environments (unconditional W-dist = 0.0825, p < 0.01).These results suggest that the ability to adapt time consumption online—rather than relying on astatic distribution learned during training—may be beneficial for generalization under distributionshift, and that fixed pre-learned schedules can be fragile in out-of-distribution settings.
guoyong chen (Tue,) studied this question.