PuzzleCraft studies how lightweight visual puzzles, exploration-aware curricula, and reasoning–answer consistency can make puzzle-based RLVR more effective for vision–language models.
Figure 1. PuzzleCraft uses three automatically verifiable puzzle environments—Jigsaw, PatchFit, and Rotation—then combines curriculum weighting with reasoning–answer consistency analysis.
PuzzleCraft frames puzzle-based RLVR as a curriculum and consistent reasoning problem, not just reward maximization. The paper introduces a supervision-free setup where all rewards are built into the environment itself, removing the need for curated labels or external verifiers.
The framework instantiates Jigsaw, PatchFit, and Rotation. Jigsaw is especially important because it supports partial-credit rewards, letting the model get credit for intermediate progress instead of collapsing everything to all-or-nothing outcomes.
Weights prompts by both difficulty and solution-space dispersion, downweighting rollout collapse.
Tracks whether the chain-of-thought actually supports the final answer.
PatchFit, Rotation, and Jigsaw provide scalable automatically verifiable rewards.
Downstream gains appear on image reasoning, video reasoning, and multiple Qwen backbones.
The paper tracks four post-training signals over time: reward variance, reasoning–answer consistency (RAC), response length, and reward score. Vanilla GRPO improves early but shows later consistency drift. The exploration-aware curriculum reduces that decline, and the combination with GRPO-CARE gives the strongest RAC profile overall.
On the six-benchmark ablation in Table 1, the full Jigsaw + Curriculum + CARE setup reaches the best average score, 75.70, ahead of Jigsaw alone at 73.44 and ahead of Jigsaw + CARE at 75.25.
| Variant | MME | MMStar | POPE | MMT | CV-Bench | MMVP | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-Instruct | 2243 | 64.67 | 81.77 | 59.64 | 73.62 | 77.67 | 72.91 |
| Jigsaw | 2340 | 64.47 | 85.17 | 59.96 | 72.13 | 75.33 | 73.44 |
| Jigsaw + CARE | 2319 | 65.80 | 86.95 | 61.18 | 77.76 | 77.00 | 75.25 |
| Jigsaw + Curriculum | 2365 | 64.27 | 84.35 | 62.62 | 75.37 | 74.67 | 74.30 |
| Jigsaw + Curriculum + CARE | 2366 | 64.60 | 86.52 | 62.26 | 77.63 | 78.67 | 75.70 |
Figure 2. RAC rises early and can degrade later under vanilla GRPO; the curriculum and CARE stabilize that trend.
All values below are copied from the current PDF tables. Avg. is the mean over 9 benchmarks, with MME normalized in the paper before averaging.
Jigsaw reaches 67.78 Avg., which is +2.20 over Visual Jigsaw and +2.12 over GRPO-CARE.
Mix reaches 60.44 Avg., outperforming VLM-R1 by +2.60 and ViGoRL by +3.15.
Jigsaw improves all three tested scales: +2.60 on 2B, +2.91 on 4B, and +1.87 on 8B.
| Model | MathVista | MathVision | MathVerse | MME | MMStar | POPE | MMT | CV-Bench | MMVP | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL (Vanilla) | 66.30 | 23.68 | 56.49 | 2243 | 64.67 | 81.77 | 59.64 | 73.62 | 77.67 | 64.88 |
| Vision-R1 | 68.89 | 39.47 | 59.99 | 2292 | 62.33 | 88.41 | 59.70 | 72.17 | 77.00 | 67.76 |
| VL-Rethinker | 72.39 | 29.90 | 67.29 | 2311 | 63.06 | 83.62 | 61.01 | 76.90 | 77.33 | 68.23 |
| GRPO-CARE | 68.70 | 20.39 | 47.71 | 2352 | 64.13 | 88.18 | 62.62 | 74.91 | 80.33 | 65.66 |
| ViCrit | 61.40 | 18.75 | 38.02 | 2167 | 62.27 | 80.50 | 59.83 | 70.84 | 75.33 | 60.48 |
| Vision-Zero | 66.20 | 21.05 | 45.86 | 2248 | 63.47 | 82.67 | 60.47 | 73.53 | 78.00 | 63.50 |
| Visual Jigsaw | 67.50 | 29.27 | 57.50 | 2243 | 62.53 | 84.78 | 57.76 | 74.46 | 76.33 | 65.58 |
| VisualSphinx | 67.80 | 26.31 | 53.90 | 2296 | 63.20 | 83.93 | 60.63 | 73.98 | 77.33 | 65.45 |
| Game-RL | 67.40 | 24.01 | 58.00 | 2229 | 64.60 | 81.77 | 61.27 | 75.24 | 77.66 | 65.51 |
| Jigsaw | 68.20 | 30.92 | 56.70 | 2366 | 64.60 | 86.52 | 62.26 | 77.63 | 78.67 | 67.78 |
| PatchFit | 68.03 | 26.31 | 50.89 | 2316 | 59.87 | 85.05 | 58.94 | 73.37 | 78.00 | 64.80 |
| Rotation | 71.70 | 22.69 | 57.70 | 2357 | 64.60 | 87.36 | 61.91 | 75.08 | 79.67 | 67.21 |
| Mix | 68.20 | 24.01 | 58.10 | 2359 | 65.20 | 85.40 | 62.65 | 76.96 | 78.33 | 67.01 |
| Model | MathVista | MathVision | MathVerse | MME | MMStar | POPE | MMT | CV-Bench | MMVP | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL (Vanilla) | 57.19 | 21.38 | 38.30 | 2180 | 54.73 | 77.41 | 53.31 | 65.62 | 63.33 | 56.57 |
| VLM-R1 | 57.99 | 19.73 | 40.80 | 2207 | 55.20 | 79.66 | 52.06 | 66.65 | 69.67 | 57.84 |
| ViGoRL | 56.10 | 18.42 | 34.60 | 1919 | 50.46 | 84.75 | 54.90 | 79.21 | 68.66 | 57.29 |
| Jigsaw-R1 | 58.80 | 22.69 | 40.30 | 2184 | 55.53 | 78.05 | 57.53 | 70.87 | 69.66 | 59.05 |
| Jigsaw | 58.09 | 18.42 | 44.70 | 2223 | 55.40 | 78.68 | 57.88 | 71.33 | 68.67 | 59.17 |
| Mix | 60.60 | 24.01 | 49.00 | 2127 | 57.53 | 77.30 | 57.72 | 72.88 | 69.00 | 60.44 |
| Model | MathVista | MathVision | MathVerse | MME | MMStar | POPE | MMT | CV-Bench | MMVP | Avg. |
|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-2B-Instruct | 40.69 | 15.13 | 28.10 | 2072 | 46.60 | 82.88 | 53.53 | 71.36 | 70.33 | 53.62 |
| Qwen3-VL-2B-Jigsaw | 42.00 | 19.07 | 37.50 | 2076 | 48.86 | 84.81 | 56.53 | 72.10 | 71.00 | 56.22 |
| Qwen3-VL-4B-Instruct | 51.63 | 22.03 | 44.30 | 2138 | 53.40 | 87.57 | 58.77 | 73.98 | 75.66 | 60.41 |
| Qwen3-VL-4B-Jigsaw | 53.40 | 28.28 | 51.90 | 2207 | 55.80 | 87.56 | 61.24 | 76.55 | 76.33 | 63.32 |
| Qwen3-VL-8B-Instruct | 57.40 | 23.68 | 47.90 | 2243 | 57.46 | 86.25 | 59.96 | 76.84 | 75.33 | 62.77 |
| Qwen3-VL-8B-Jigsaw | 58.19 | 28.28 | 55.20 | 2227 | 57.66 | 86.58 | 61.72 | 76.56 | 78.00 | 64.64 |
Without using video data during post-training, Qwen2.5-VL-7B Jigsaw reaches 53.93 Avg. on 7 video reasoning benchmarks. That is +1.47 over the strongest puzzle baseline, VisualSphinx at 52.46, and it comes close to the video-supervised Video-R1 score of 55.11.
| Method | CGBench | Video-MMMU | MVBench | TempCompass | Video-MME | Video-TT | QBench-Video | Avg. |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-Instruct | 26.71 | 33.83 | 55.80 | 55.00 | 65.40 | 34.40 | 58.50 | 47.09 |
| Visual Jigsaw | 31.56 | 32.33 | 55.60 | 66.26 | 66.20 | 37.40 | 57.48 | 49.54 |
| VisualSphinx | 31.17 | 43.16 | 57.40 | 70.00 | 72.00 | 36.40 | 57.14 | 52.46 |
| Video-R1 | 37.00 | 40.16 | 63.40 | 78.20 | 72.20 | 39.40 | 55.44 | 55.11 |
| ViCrit | 35.85 | 34.00 | 58.40 | 54.40 | 72.20 | 36.60 | 56.80 | 49.75 |
| GRPO-CARE | 33.94 | 34.50 | 62.00 | 67.60 | 62.80 | 38.80 | 54.76 | 50.62 |
| Vision-Zero | 27.48 | 37.50 | 60.00 | 51.20 | 67.80 | 34.40 | 58.16 | 48.07 |
| Jigsaw | 36.33 | 38.00 | 59.00 | 75.40 | 71.40 | 38.60 | 58.84 | 53.93 |
| Mix | 34.86 | 42.66 | 61.00 | 71.80 | 71.00 | 34.59 | 56.12 | 53.14 |
| Model | MME | MMStar | POPE | MMT | CV-Bench | MMVP | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-Instruct | 2308 | 63.27 | 86.37 | 62.74 | 76.29 | 78.33 | 74.90 |
| ViCrit | 2178 | 64.13 | 85.96 | 62.58 | 76.59 | 76.33 | 73.90 |
| Vision-Zero | 2306 | 63.53 | 85.92 | 63.06 | 76.00 | 77.67 | 74.76 |
| Visual Jigsaw | 2313 | 64.13 | 86.37 | 62.20 | 76.47 | 79.00 | 75.13 |
| VisualSphinx | 2341 | 63.73 | 86.34 | 62.65 | 76.76 | 77.33 | 75.07 |
| GRPO-CARE | 2355 | 63.93 | 86.85 | 63.58 | 75.81 | 78.00 | 75.38 |
| Jigsaw | 2348 | 64.33 | 86.05 | 63.70 | 76.12 | 77.33 | 75.23 |
| Mix | 2371 | 64.73 | 86.65 | 64.09 | 77.36 | 78.67 | 76.03 |
| Model | Jigsaw | PatchFit | Rotation |
|---|---|---|---|
| Qwen2.5-VL-Instruct | 25.59 | 21.2 | 53.3 |
| Jigsaw | 36.65 | 21.7 | 33.9 |
| PatchFit | 17.88 | 62.6 | 51.8 |
| Rotation | 25.55 | 20.9 | 70.8 |
| Mix | 36.83 | 48.6 | 83.2 |
Mixed-puzzle training gives the strongest overall transfer, lifting Jigsaw by +11.24, PatchFit by +27.4, and Rotation by +29.9 over the Qwen2.5-VL-Instruct baseline.
The paper's conclusion is simple: puzzle-based RLVR has real headroom, but progress depends less on merely having puzzles and more on training dynamics. Flat weighting wastes compute, collapsed rollouts weaken learning, and reward alone is not a reliable proxy for faithful reasoning. PuzzleCraft improves this with exploration-aware weighting and explicit consistency analysis.
@misc{jeddi2026puzzlecraftexplorationawarecurriculumlearning,
title={PuzzleCraft: Exploration-Aware Curriculum Learning for Puzzle-Based RLVR in VLMs},
author={Ahmadreza Jeddi and Hakki Can Karaimer and Hue Nguyen and Zhongling Wang and Ke Zhao and Javad Rajabi and Ran Zhang and Raghav Goyal and Konstantinos G. Derpanis and Babak Taati and Radek Grzeszczuk},
year={2026},
eprint={2512.14944},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.14944},
}