March 3, 2026Open Access

Självkorrigerande belöningsutformning med språkmodeller för förstärkningsinlärningsagenter i spel

Puntos clave

LLMs can effectively generate reward configurations for agents, enhancing their behavior in games.
In a racing task, performance improved from approximately 9% to over 80% after a single feedback iteration.
The approach includes a zero-shot pipeline and an iterative self-correcting loop based on performance metrics.
These findings suggest a significant reduction in reliance on RL experts for reward design in game development.

Resumen

Reinforcement Learning (RL) in games has gained significant momentum in recent years, enabling the creation of diverse agent behaviors that can enrich a player’s experience. However, deploying RL agents in production settings still presents two key challenges: (1) crafting effective reward functions typically requires RL expertise, and (2) once a game’s content or mechanics are changed, previously tuned rewards often need to be reworked from scratch. This thesis proposes a method for automating reward tuning through the use of Large Language Models (LLMs), starting from a high-level user prompt describing the desired agent behavior. We present two approaches: a zero-shot pipeline, where the LLM proposes an initial reward configuration without any feedback, and a self-correcting loop that iteratively refines these weights based on relevant performance metrics. We test both methods across four tasks in two different environments: navigation and racing. Our results show that LLMs can generate reward configurations that lead to meaningful agent behavior even without iteration, and that the feedback-driven loop consistently improves performance over time. In the racing task, for instance, success rates increased from ∼9% to 74% in a single feedback iteration, eventually reaching over 80% – a competitive performance against the human-expert baseline of 94%, and with identical efficiency in terms of average completion time. With these results, this work takes a first step toward automating reward design in RL, enabling game designers to integrate RL agents with reduced reliance on specialists.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo