Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose LMM-R1, a two-stage framework adapting rule-based RL for multimodal reasoning through Foundational Reasoning Enhancement (FRE) followed by Multimodal Generalization Training (MGT). The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2. 5-VL-Instruct-3B demonstrate that LMM-R1 achieves 4. 83\% and 4. 5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3. 63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.
Building similarity graph...
Analyzing shared references across papers
Loading...
Peng et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68d90a0f41e1c178a14f6936 — DOI: https://doi.org/10.48550/arxiv.2503.07536
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Yingzhe Peng
Gongrui Zhang
Miaosen Zhang
Building similarity graph...
Analyzing shared references across papers
Loading...