What type of study is this?

This is a Experimental Study study.

October 10, 2025Open Access

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Key Points

Praxis-VLM showcases strong decision-making abilities adapted from textual descriptions, enhancing reasoning skills.
Benchmark tests revealed a significant performance improvement over typical supervised fine-tuning methods.
Employing the GRPO algorithm, Praxis-VLM effectively learns to assess actions through textual scenarios.
The finding suggests that strong reasoning abilities can be developed even without extensive paired image-text data.

Abstract

Vision Language Models exhibit impressive performance for various tasks, yet they often lack the sophisticated situational reasoning required for complex decision-making. This paper shows that VLMs can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions, suggesting foundational reasoning can be effectively learned from language. Motivated by this insight, we propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making. Praxis-VLM employs the GRPO algorithm on textual scenarios to instill robust reasoning capabilities, where models learn to evaluate actions and their consequences. These reasoning skills, acquired purely from text, successfully transfer to multimodal inference with visual inputs, significantly reducing reliance on scarce paired image-text training data. Experiments across diverse decision-making benchmarks demonstrate that Praxis-VLM substantially outperforms standard supervised fine-tuning, exhibiting superior performance and generalizability. Further analysis confirms that our models engage in explicit and effective reasoning, underpinning their enhanced performance and adaptability.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zhe Hu

Jing Li

Zuhui Pu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider