Reinforcement Learning (RL) algorithms depend on well-defined reward functions for policy optimization. Designing such functions is a complex task, even for domain experts. However, valid task demonstrations can still be collected from moderately skilled individuals. This motivates the use of methods such as Imitation Learning (IL) and Inverse Reinforcement Learning (IRL), where an expert provides demonstrations, allowing the algorithm to infer a policy or reward function that aligns with observed behavior. A common assumption in IRL is that demonstrations come from highly skilled experts. While some studies have explored the impact of suboptimal demonstrators, the influence of intentionally malicious demonstrations remains underexplored. This study introduces an adversarial demonstrator framework that systematically perturbs a subset of demonstrations to manipulate the reward function inferred by an IRL algorithm. Additionally, it quantifies the impact of such adversarial manipulations on the learned policy. Our results show that simply altering 10% of the demonstrations can lead the IRL algorithm to learn a faulty reward function, ultimately degrading the performance of the trained policy by up to three times the effect of adding 10% random trajectories to the result.
Building similarity graph...
Analyzing shared references across papers
Loading...
Arezoo Alipanah
Yash Vardhan Pant
University of Waterloo
Building similarity graph...
Analyzing shared references across papers
Loading...
Alipanah et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69a75f8fc6e9836116a2b037 — DOI: https://doi.org/10.21428/594757db.25036134