What type of study is this?

This is a Quantitative Study study.

September 29, 2025Open Access

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

Key Points

Generative RLHF-V achieves an 18.1% improvement in model performance across multiple benchmarks.
The experimental results show a significant out-of-distribution generalization improvement in RM discrimination.
This novel framework combines generative reward models and reinforcement learning to actively capture human intention.
The methodology reveals that performance improves linearly with the number of candidate responses.

Abstract

Training multi-modal large language models (MLLMs) that align with human intentions is a long-term challenge. Traditional score-only reward models for alignment suffer from low accuracy, weak generalization, and poor interpretability, blocking the progress of alignment methods, e. g. , reinforcement learning from human feedback (RLHF). Generative reward models (GRMs) leverage MLLMs' intrinsic reasoning capabilities to discriminate pair-wise responses, but their pair-wise paradigm makes it hard to generalize to learnable rewards. We introduce Generative RLHF-V, a novel alignment framework that integrates GRMs with multi-modal RLHF. We propose a two-stage pipeline: multi-modal generative reward modeling from RL, where RL guides GRMs to actively capture human intention, then predict the correct pair-wise scores; and RL optimization from grouped comparison, which enhances multi-modal RL scoring precision by grouped responses comparison. Experimental results demonstrate that, besides out-of-distribution generalization of RM discrimination, our framework improves 4 MLLMs' performance across 7 benchmarks by 18. 1\%, while the baseline RLHF is only 5. 3\%. We further validate that Generative RLHF-V achieves a near-linear improvement with an increasing number of candidate responses. Our code and models can be found at https: //generative-rlhf-v. github. io.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhou et al. (Sat,) studied this question.

www.synapsesocial.com/papers/68da58d8c1728099cfd110f8 — DOI: https://doi.org/10.48550/arxiv.2505.18531

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Jiayi Zhou

Jiaming Ji

Boyuan Chen

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Generative RLHF-V: Learning Principles from Multi-modal Human Preference

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion