March 3, 2026Open Access

Generating Malicious Demonstration Policies to Exploit Vulnerabilities in Inverse Reinforcement Learning

Key Points

Altering 10% of demonstrations leads to a faulty reward function and degrades policy performance.
The IRL algorithm's learned policy can suffer a three-fold impact from manipulated demonstrations.
Adversarial manipulations can systemically perturb reward functions inferred by IRL algorithms.
Observation of the effects of malicious demonstrations highlights a critical gap in existing IRL studies.

Abstract

Reinforcement Learning (RL) algorithms depend on well-defined reward functions for policy optimization. Designing such functions is a complex task, even for domain experts. However, valid task demonstrations can still be collected from moderately skilled individuals. This motivates the use of methods such as Imitation Learning (IL) and Inverse Reinforcement Learning (IRL), where an expert provides demonstrations, allowing the algorithm to infer a policy or reward function that aligns with observed behavior. A common assumption in IRL is that demonstrations come from highly skilled experts. While some studies have explored the impact of suboptimal demonstrators, the influence of intentionally malicious demonstrations remains underexplored. This study introduces an adversarial demonstrator framework that systematically perturbs a subset of demonstrations to manipulate the reward function inferred by an IRL algorithm. Additionally, it quantifies the impact of such adversarial manipulations on the learned policy. Our results show that simply altering 10% of the demonstrations can lead the IRL algorithm to learn a faulty reward function, ultimately degrading the performance of the trained policy by up to three times the effect of adding 10% random trajectories to the result.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Arezoo Alipanah

Yash Vardhan Pant

Actions

Institutions

University of Waterloo

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Generating Malicious Demonstration Policies to Exploit Vulnerabilities in Inverse Reinforcement Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study