What type of study is this?

This is a Quantitative Study study.

September 23, 2025Open Access

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Key Points

SOPHIA enhances vision-language models by integrating slow-thinking reasoning, showing effective multimodal task performance.
The method improved InternVL3.0-38B by 8.50% in average accuracy on multiple benchmarks, showcasing the power of semi-off-policy RL.
It uses a combination of on-policy and off-policy strategies to mitigate visual hallucinations while training the LVLMs.
Experimental results indicate SOPHIA outperforms traditional supervised fine-tuning and on-policy RL methods, enabling better initial policy training.

Abstract

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Junhao Shen

Haiteng Zhao

Yuantong Gu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study