February 25, 2024Open Access

Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

Key Points

Key points are not available for this paper at this time.

Abstract

While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of Large Language Models (LLMs), recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based calibration methods as viable alternatives. This paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. Building upon these findings, we propose a novel Value-based CaliBration (VCB) method to better align LLMs with human preferences. Experimental results demonstrate that VCB surpasses existing alignment methods on AI assistant and summarization datasets, providing impressive generalizability, robustness, and stability in diverse settings.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Mao et al. (Sun,) studied this question.

www.synapsesocial.com/papers/68e77b35b6db6435876ef951 — DOI: https://doi.org/10.48550/arxiv.2402.16030

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Xin Mao

Feng-Lin Li

Huimin Xu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion