March 8, 2024Open Access

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Key Points

Key points are not available for this paper at this time.

Abstract

We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhang et al. (Fri,) studied this question.

www.synapsesocial.com/papers/68e752dab6db6435876cb7fc — DOI: https://doi.org/10.48550/arxiv.2403.05171

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Xiaoying Zhang

Jean-François Ton

Wei Shen

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion