What type of study is this?

October 10, 2025Open Access

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Key Points

Reason-RFT achieves state-of-the-art performance in visual reasoning tasks, surpassing existing models.
It demonstrates superior generalization across domain shifts, highlighting its robustness in various applications.
The framework excels in few-shot learning, outperforming traditional supervised fine-tuning baselines.
The proposed method incorporates a novel reinforcement learning approach to enhance adaptability to changing data contexts.

Abstract

Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model's generalization ability under domain shifts and reducing real-world applicability. To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs. This is followed by reinforcement learning based on Group Relative Policy Optimization (GRPO), which generates multiple reasoning-response pairs to enhance adaptability to domain shifts. To evaluate Reason-RFT, we reconstructed a comprehensive dataset covering visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three key dimensions. Experimental results highlight three advantages: (1) performance enhancement, with Reason-RFT achieving state-of-the-art results and outperforming both open-source and proprietary models; (2) generalization superiority, maintaining robust performance under domain shifts across various tasks; and (3) data efficiency, excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines. Reason-RFT introduces a novel training paradigm for visual reasoning and marks a significant step forward in multimodal research. Project website: https://tanhuajie.github.io/ReasonRFT

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Tan et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68e861a57ef2f04ca37e4582 — DOI: https://doi.org/10.48550/arxiv.2503.20752

Authors

Huajie Tan

Yuheng Ji

Xiaoshuai Hao

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion