What type of study is this?

This is a Quantitative Study study.

September 20, 2025

Indirect Online Preference Optimization via Reinforcement Learning

Key Points

Indirect online preference optimization reduces model iteration time drastically from months to just one week, enhancing overall efficiency.
Utilizing the adversarial training approach, the study addresses distribution bias between human and model annotations effectively.
Extensive experiments show that the proposed IOPO mechanism outperforms existing state-of-the-art methods in both offline and online alignment scenarios.
The approach maintains linear computational complexity while enhancing human preference alignment for large language models.

Abstract

Human preference alignment (HPA) aims to ensure Large Language Models (LLMs) responding appropriately to meet human moral and ethical requirements. Existing methods, such as RLHF and DPO, rely heavily on high-quality human annotation, which restrict the efficiency of iterative online model refinement. To address the inefficiencies of human annotation acquisition, iterated online strategy advocates the use of fine-tuned LLMs to self-generate preference data. However, this approach is prone to distribution bias, because of differences between human and model annotations, as well as modeling errors between simulators and real-world contexts. To mitigate the impact of distribution bias, we adopt the principles of adversarial training, framing a zero-sum two-player game with a protagonist agent and an adversarial agent. With the adversarial agent challenging the alignment of protagonist agent, we continuously refine the protagonist’s performance. By utilizing min-max equilibrium and Nash equilibrium strategies, we propose Indirect Online Preference Optimization (IOPO) mechanism that enables the protagonist agent to converge without bias while maintaining linear computational complexity. Extensive experiments across three real-world datasets demonstrate that IOPO outperforms state-of-the-art alignment methods in both offline and online scenarios, evidenced by standard alignment metrics and human evaluations. This innovation reduces the time required for model iterations from months to one week, alleviates distribution shifts, and significantly cuts annotation costs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

En Wang

Xingyu Lin

Du Su

Actions

Institutions

Jilin University

Institute of Computing Technology

Baidu (China)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Indirect Online Preference Optimization via Reinforcement Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study