RLVR in Vision Language Models: Findings, Questions and Directions

Authors: Liang Chen · Lei Li · Haozhe Zhao · Yifan Song · Vinci · Zihao Yue · Lingpeng Kong · Qi Liu · Baobao Chang

Introduction

With the R1-V framework, we firstly explored the application of Reinforcement Learning with Verifiable Reward (RLVR) and reasoning trace distillation from DeepSeek R1 to enhance Large Vision-Language Models (LVLMs) in visual reasoning tasks. With the comparison experiments to SFT on different tasks such as visual counting, geometric reasoning, and complex visual reasoning, we find that RLVR methods show stronger OOD generalization ability while SFT sometimes shows better in-domain performance in various visual reasoning tasks. It elicits our further research interests in explaining the discrepancy between RLVR and SFT in the vision-language domain and to extend the method to more broad tasks.

🔔 Notes: We highlight the key findings and underexplored questions with colors in the post for fast takeaways.

<aside> 💡

TL;DR

Reinforcement Learning with Verifiable Reward (RLVR) outperforms Supervised Fine-Tuning (SFT) in out-of-distribution (OOD) visual reasoning tasks (e.g., complex counting/QA), while SFT excels in in-domain scenarios like geometry.
Forcing Chain-of-Thought reasoning in SFT often harms small VLMs’ performance. RLVR’s GRPO method enables adaptive reasoning (no/long CoT) without human guidance, achieving strong generalization (68.7% vs SFT’s 19.4% on SuperClevr).
Key open questions:
- Why do RL/SFT excel in different domains? How to combine them?
- How does model scale influence the RL/SFT performance difference?
- How to design rewards for open-ended vision/agent tasks? Future work aims to unify these approaches and expand RL to less structured problems. </aside>

Methods

Visual RL with Verifiable Reward

We first investigate the self-evolution of VLMs through RL with Verifiable Reward (RLVR). We focus on precision-critical tasks like counting and mathematical problems, where we can easily collect rule-based outcome rewards for model optimization. Given a training sample $(x_i,y^i)$, we task a policy model ${\pi}{\theta}$ (i.e., the VLM) to predict the solution $y_i$of the input $x_i$, and gather outcome rewards by verifying its correctness based on the ground truth answer $y^i$. To enable the exploration of possible solution paths, we prompt ${\pi}{\theta}$ to generate a thinking process $z_i$ before outputting the answer, with a format of the overall response as follows:

<think>thinking path</think>\\n\\n<answer>prediction</answer>

Following DeepSeek-R1-Zero, we employ an accuracy reward and an additional format reward for model optimization. The accuracy reward is given by validating the numerical or symbolic correctness of the final solution. The format reward evaluates whether the model output strictly adheres to the given format, i.e., wrapping the thinking process and final answer in corresponding tags.

We use Group Relative Policy Optimization (GRPO) as our RL algorithm, and directly apply RL to the base model without SFT as a cold start. Note that all RLVR experiments do not use CoT during training.

SFT with Visual Reasoning Trace Distillated from R1

We investigate the transfer of reasoning capabilities from text-only R1 to Vision-Language Models (VLMs).

While reasoning LLMs like R1 excel at explicit step-by-step reasoning, VLMs often lack such transparent reasoning abilities.