Supervision-free RLVR for VLM post-training

PuzzleCraft: Exploration-Aware Curriculum Learning for Puzzle-Based RLVR in VLMs

PuzzleCraft studies how lightweight visual puzzles, exploration-aware curricula, and reasoning–answer consistency can make puzzle-based RLVR more effective for vision–language models.

Ahmadreza Jeddi1,2,3,*, Hakki C. Karaimer1,*, Hue Nguyen1, Zhongling Wang1,**, Ke Zhao1,**, Javad Rajabi1,2,3,**, Ran Zhang1,**, Raghav Goyal1, Konstantinos G. Derpanis1,3,4, Babak Taati2,3, Radek Grzeszczuk1
1AI Center–Toronto, Samsung Electronics 2University of Toronto 3Vector Institute 4York University
* Equal contribution    ** Equal contribution
PuzzleCraft overview figure from the paper

Figure 1. PuzzleCraft uses three automatically verifiable puzzle environments—Jigsaw, PatchFit, and Rotation—then combines curriculum weighting with reasoning–answer consistency analysis.

What changes in PuzzleCraft?

PuzzleCraft frames puzzle-based RLVR as a curriculum and consistent reasoning problem, not just reward maximization. The paper introduces a supervision-free setup where all rewards are built into the environment itself, removing the need for curated labels or external verifiers.

The framework instantiates Jigsaw, PatchFit, and Rotation. Jigsaw is especially important because it supports partial-credit rewards, letting the model get credit for intermediate progress instead of collapsing everything to all-or-nothing outcomes.

Exploration-aware curriculum

Weights prompts by both difficulty and solution-space dispersion, downweighting rollout collapse.

Reasoning–Answer Consistency

Tracks whether the chain-of-thought actually supports the final answer.

Lightweight puzzle environments

PatchFit, Rotation, and Jigsaw provide scalable automatically verifiable rewards.

Transfer beyond puzzles

Downstream gains appear on image reasoning, video reasoning, and multiple Qwen backbones.

Key numbers from the paper

67.78
Qwen2.5-VL-7B Jigsaw average on 9 image benchmarks
+2.20 over the strongest puzzle baseline (65.58)
53.93
Qwen2.5-VL-7B Jigsaw average on 7 video benchmarks
+1.47 over VisualSphinx (52.46)
60.44
Qwen2.5-VL-3B Mix average on 9 image benchmarks
+2.60 over VLM-R1 and +3.15 over ViGoRL
+2.60 / +2.91 / +1.87
Qwen3-VL Jigsaw gains over 2B / 4B / 8B instruct checkpoints
From the main image benchmark table

Consistency and curriculum matter together

The paper tracks four post-training signals over time: reward variance, reasoning–answer consistency (RAC), response length, and reward score. Vanilla GRPO improves early but shows later consistency drift. The exploration-aware curriculum reduces that decline, and the combination with GRPO-CARE gives the strongest RAC profile overall.

On the six-benchmark ablation in Table 1, the full Jigsaw + Curriculum + CARE setup reaches the best average score, 75.70, ahead of Jigsaw alone at 73.44 and ahead of Jigsaw + CARE at 75.25.

Table 1. Consistency ablation on Qwen2.5-VL-7B
Variant MME MMStar POPE MMT CV-Bench MMVP Avg.
Qwen2.5-VL-Instruct 2243 64.67 81.77 59.64 73.62 77.67 72.91
Jigsaw 2340 64.47 85.17 59.96 72.13 75.33 73.44
Jigsaw + CARE 2319 65.80 86.95 61.18 77.76 77.00 75.25
Jigsaw + Curriculum 2365 64.27 84.35 62.62 75.37 74.67 74.30
Jigsaw + Curriculum + CARE 2366 64.60 86.52 62.26 77.63 78.67 75.70
RAC and training dynamics from the paper

Figure 2. RAC rises early and can degrade later under vanilla GRPO; the curriculum and CARE stabilize that trend.

Image benchmark results

All values below are copied from the current PDF tables. Avg. is the mean over 9 benchmarks, with MME normalized in the paper before averaging.

Qwen2.5-VL-7B

Jigsaw reaches 67.78 Avg., which is +2.20 over Visual Jigsaw and +2.12 over GRPO-CARE.

Qwen2.5-VL-3B

Mix reaches 60.44 Avg., outperforming VLM-R1 by +2.60 and ViGoRL by +3.15.

Qwen3-VL

Jigsaw improves all three tested scales: +2.60 on 2B, +2.91 on 4B, and +1.87 on 8B.

Table 2a. Qwen2.5-VL-7B on 9 image benchmarks
Model MathVista MathVision MathVerse MME MMStar POPE MMT CV-Bench MMVP Avg.
Qwen2.5-VL (Vanilla) 66.30 23.68 56.49 2243 64.67 81.77 59.64 73.62 77.67 64.88
Vision-R1 68.89 39.47 59.99 2292 62.33 88.41 59.70 72.17 77.00 67.76
VL-Rethinker 72.39 29.90 67.29 2311 63.06 83.62 61.01 76.90 77.33 68.23
GRPO-CARE 68.70 20.39 47.71 2352 64.13 88.18 62.62 74.91 80.33 65.66
ViCrit 61.40 18.75 38.02 2167 62.27 80.50 59.83 70.84 75.33 60.48
Vision-Zero 66.20 21.05 45.86 2248 63.47 82.67 60.47 73.53 78.00 63.50
Visual Jigsaw 67.50 29.27 57.50 2243 62.53 84.78 57.76 74.46 76.33 65.58
VisualSphinx 67.80 26.31 53.90 2296 63.20 83.93 60.63 73.98 77.33 65.45
Game-RL 67.40 24.01 58.00 2229 64.60 81.77 61.27 75.24 77.66 65.51
Jigsaw 68.20 30.92 56.70 2366 64.60 86.52 62.26 77.63 78.67 67.78
PatchFit 68.03 26.31 50.89 2316 59.87 85.05 58.94 73.37 78.00 64.80
Rotation 71.70 22.69 57.70 2357 64.60 87.36 61.91 75.08 79.67 67.21
Mix 68.20 24.01 58.10 2359 65.20 85.40 62.65 76.96 78.33 67.01
Table 2b. Qwen2.5-VL-3B on 9 image benchmarks
Model MathVista MathVision MathVerse MME MMStar POPE MMT CV-Bench MMVP Avg.
Qwen2.5-VL (Vanilla) 57.19 21.38 38.30 2180 54.73 77.41 53.31 65.62 63.33 56.57
VLM-R1 57.99 19.73 40.80 2207 55.20 79.66 52.06 66.65 69.67 57.84
ViGoRL 56.10 18.42 34.60 1919 50.46 84.75 54.90 79.21 68.66 57.29
Jigsaw-R1 58.80 22.69 40.30 2184 55.53 78.05 57.53 70.87 69.66 59.05
Jigsaw 58.09 18.42 44.70 2223 55.40 78.68 57.88 71.33 68.67 59.17
Mix 60.60 24.01 49.00 2127 57.53 77.30 57.72 72.88 69.00 60.44
Table 2c. Qwen3-VL on 9 image benchmarks
Model MathVista MathVision MathVerse MME MMStar POPE MMT CV-Bench MMVP Avg.
Qwen3-VL-2B-Instruct 40.69 15.13 28.10 2072 46.60 82.88 53.53 71.36 70.33 53.62
Qwen3-VL-2B-Jigsaw 42.00 19.07 37.50 2076 48.86 84.81 56.53 72.10 71.00 56.22
Qwen3-VL-4B-Instruct 51.63 22.03 44.30 2138 53.40 87.57 58.77 73.98 75.66 60.41
Qwen3-VL-4B-Jigsaw 53.40 28.28 51.90 2207 55.80 87.56 61.24 76.55 76.33 63.32
Qwen3-VL-8B-Instruct 57.40 23.68 47.90 2243 57.46 86.25 59.96 76.84 75.33 62.77
Qwen3-VL-8B-Jigsaw 58.19 28.28 55.20 2227 57.66 86.58 61.72 76.56 78.00 64.64

Puzzle-based RLVR transfers to video benchmarks too

Without using video data during post-training, Qwen2.5-VL-7B Jigsaw reaches 53.93 Avg. on 7 video reasoning benchmarks. That is +1.47 over the strongest puzzle baseline, VisualSphinx at 52.46, and it comes close to the video-supervised Video-R1 score of 55.11.

Table 3. Qwen2.5-VL-7B on 7 video reasoning benchmarks
Method CGBench Video-MMMU MVBench TempCompass Video-MME Video-TT QBench-Video Avg.
Qwen2.5-VL-Instruct 26.71 33.83 55.80 55.00 65.40 34.40 58.50 47.09
Visual Jigsaw 31.56 32.33 55.60 66.26 66.20 37.40 57.48 49.54
VisualSphinx 31.17 43.16 57.40 70.00 72.00 36.40 57.14 52.46
Video-R1 37.00 40.16 63.40 78.20 72.20 39.40 55.44 55.11
ViCrit 35.85 34.00 58.40 54.40 72.20 36.60 56.80 49.75
GRPO-CARE 33.94 34.50 62.00 67.60 62.80 38.80 54.76 50.62
Vision-Zero 27.48 37.50 60.00 51.20 67.80 34.40 58.16 48.07
Jigsaw 36.33 38.00 59.00 75.40 71.40 38.60 58.84 53.93
Mix 34.86 42.66 61.00 71.80 71.00 34.59 56.12 53.14

Direct mode and puzzle transfer

Table 4. Direct inference mode (no explicit CoT), Qwen2.5-VL-7B
Model MME MMStar POPE MMT CV-Bench MMVP Avg.
Qwen2.5-VL-Instruct 2308 63.27 86.37 62.74 76.29 78.33 74.90
ViCrit 2178 64.13 85.96 62.58 76.59 76.33 73.90
Vision-Zero 2306 63.53 85.92 63.06 76.00 77.67 74.76
Visual Jigsaw 2313 64.13 86.37 62.20 76.47 79.00 75.13
VisualSphinx 2341 63.73 86.34 62.65 76.76 77.33 75.07
GRPO-CARE 2355 63.93 86.85 63.58 75.81 78.00 75.38
Jigsaw 2348 64.33 86.05 63.70 76.12 77.33 75.23
Mix 2371 64.73 86.65 64.09 77.36 78.67 76.03
Table 5. Inter-puzzle transfer
Model Jigsaw PatchFit Rotation
Qwen2.5-VL-Instruct 25.59 21.2 53.3
Jigsaw 36.65 21.7 33.9
PatchFit 17.88 62.6 51.8
Rotation 25.55 20.9 70.8
Mix 36.83 48.6 83.2

Mixed-puzzle training gives the strongest overall transfer, lifting Jigsaw by +11.24, PatchFit by +27.4, and Rotation by +29.9 over the Qwen2.5-VL-Instruct baseline.

Main message

The paper's conclusion is simple: puzzle-based RLVR has real headroom, but progress depends less on merely having puzzles and more on training dynamics. Flat weighting wastes compute, collapsed rollouts weaken learning, and reward alone is not a reliable proxy for faithful reasoning. PuzzleCraft improves this with exploration-aware weighting and explicit consistency analysis.

BibTeX

@misc{jeddi2026puzzlecraftexplorationawarecurriculumlearning,
          title={PuzzleCraft: Exploration-Aware Curriculum Learning for Puzzle-Based RLVR in VLMs}, 
          author={Ahmadreza Jeddi and Hakki Can Karaimer and Hue Nguyen and Zhongling Wang and Ke Zhao and Javad Rajabi and Ran Zhang and Raghav Goyal and Konstantinos G. Derpanis and Babak Taati and Radek Grzeszczuk},
          year={2026},
          eprint={2512.14944},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2512.14944}, 
    }