Supervision-free RLVR for VLM post-training

PuzzleCraft: Exploration-Aware Curriculum Learning for Puzzle-Based RLVR in VLMs

PuzzleCraft studies how lightweight visual puzzles, exploration-aware curricula, and reasoning–answer consistency can make puzzle-based RLVR more effective for vision–language models.

Ahmadreza Jeddi^1,2,3,*, Hakki C. Karaimer^1,*, Hue Nguyen¹, Zhongling Wang^1,**, Ke Zhao^1,**, Javad Rajabi^1,2,3,**, Ran Zhang^1,**, Raghav Goyal¹, Konstantinos G. Derpanis^1,3,4, Babak Taati^2,3, Radek Grzeszczuk¹

¹AI Center–Toronto, Samsung Electronics ²University of Toronto ³Vector Institute ⁴York University

* Equal contribution ** Equal contribution

arXiv Hugging Face Results BibTeX

Figure 1. PuzzleCraft uses three automatically verifiable puzzle environments—Jigsaw, PatchFit, and Rotation—then combines curriculum weighting with reasoning–answer consistency analysis.

Overview

What changes in PuzzleCraft?

PuzzleCraft frames puzzle-based RLVR as a curriculum and consistent reasoning problem, not just reward maximization. The paper introduces a supervision-free setup where all rewards are built into the environment itself, removing the need for curated labels or external verifiers.

The framework instantiates Jigsaw, PatchFit, and Rotation. Jigsaw is especially important because it supports partial-credit rewards, letting the model get credit for intermediate progress instead of collapsing everything to all-or-nothing outcomes.

Exploration-aware curriculum

Weights prompts by both difficulty and solution-space dispersion, downweighting rollout collapse.

Reasoning–Answer Consistency

Tracks whether the chain-of-thought actually supports the final answer.

Lightweight puzzle environments

PatchFit, Rotation, and Jigsaw provide scalable automatically verifiable rewards.

Transfer beyond puzzles

Downstream gains appear on image reasoning, video reasoning, and multiple Qwen backbones.

Highlights

Key numbers from the paper

67.78

Qwen2.5-VL-7B Jigsaw average on 9 image benchmarks

+2.20 over the strongest puzzle baseline (65.58)

53.93

Qwen2.5-VL-7B Jigsaw average on 7 video benchmarks

+1.47 over VisualSphinx (52.46)

60.44

Qwen2.5-VL-3B Mix average on 9 image benchmarks

+2.60 over VLM-R1 and +3.15 over ViGoRL

+2.60 / +2.91 / +1.87

Qwen3-VL Jigsaw gains over 2B / 4B / 8B instruct checkpoints

From the main image benchmark table

Ablations

Consistency and curriculum matter together

The paper tracks four post-training signals over time: reward variance, reasoning–answer consistency (RAC), response length, and reward score. Vanilla GRPO improves early but shows later consistency drift. The exploration-aware curriculum reduces that decline, and the combination with GRPO-CARE gives the strongest RAC profile overall.

On the six-benchmark ablation in Table 1, the full Jigsaw + Curriculum + CARE setup reaches the best average score, 75.70, ahead of Jigsaw alone at 73.44 and ahead of Jigsaw + CARE at 75.25.

Table 1. Consistency ablation on Qwen2.5-VL-7B

Variant	MME	MMStar	POPE	MMT	CV-Bench	MMVP	Avg.
Qwen2.5-VL-Instruct	2243	64.67	81.77	59.64	73.62	77.67	72.91
Jigsaw	2340	64.47	85.17	59.96	72.13	75.33	73.44
Jigsaw + CARE	2319	65.80	86.95	61.18	77.76	77.00	75.25
Jigsaw + Curriculum	2365	64.27	84.35	62.62	75.37	74.67	74.30
Jigsaw + Curriculum + CARE	2366	64.60	86.52	62.26	77.63	78.67	75.70

RAC and training dynamics from the paper

Figure 2. RAC rises early and can degrade later under vanilla GRPO; the curriculum and CARE stabilize that trend.

Results

Image benchmark results

All values below are copied from the current PDF tables. Avg. is the mean over 9 benchmarks, with MME normalized in the paper before averaging.

Qwen2.5-VL-7B

Jigsaw reaches 67.78 Avg., which is +2.20 over Visual Jigsaw and +2.12 over GRPO-CARE.

Qwen2.5-VL-3B

Mix reaches 60.44 Avg., outperforming VLM-R1 by +2.60 and ViGoRL by +3.15.

Qwen3-VL

Jigsaw improves all three tested scales: +2.60 on 2B, +2.91 on 4B, and +1.87 on 8B.

Table 2a. Qwen2.5-VL-7B on 9 image benchmarks

Model	MathVista	MathVision	MathVerse	MME	MMStar	POPE	MMT	CV-Bench	MMVP	Avg.
Qwen2.5-VL (Vanilla)	66.30	23.68	56.49	2243	64.67	81.77	59.64	73.62	77.67	64.88
Vision-R1	68.89	39.47	59.99	2292	62.33	88.41	59.70	72.17	77.00	67.76
VL-Rethinker	72.39	29.90	67.29	2311	63.06	83.62	61.01	76.90	77.33	68.23
GRPO-CARE	68.70	20.39	47.71	2352	64.13	88.18	62.62	74.91	80.33	65.66
ViCrit	61.40	18.75	38.02	2167	62.27	80.50	59.83	70.84	75.33	60.48
Vision-Zero	66.20	21.05	45.86	2248	63.47	82.67	60.47	73.53	78.00	63.50
Visual Jigsaw	67.50	29.27	57.50	2243	62.53	84.78	57.76	74.46	76.33	65.58
VisualSphinx	67.80	26.31	53.90	2296	63.20	83.93	60.63	73.98	77.33	65.45
Game-RL	67.40	24.01	58.00	2229	64.60	81.77	61.27	75.24	77.66	65.51
Jigsaw	68.20	30.92	56.70	2366	64.60	86.52	62.26	77.63	78.67	67.78
PatchFit	68.03	26.31	50.89	2316	59.87	85.05	58.94	73.37	78.00	64.80
Rotation	71.70	22.69	57.70	2357	64.60	87.36	61.91	75.08	79.67	67.21
Mix	68.20	24.01	58.10	2359	65.20	85.40	62.65	76.96	78.33	67.01

Table 2b. Qwen2.5-VL-3B on 9 image benchmarks

Model	MathVista	MathVision	MathVerse	MME	MMStar	POPE	MMT	CV-Bench	MMVP	Avg.
Qwen2.5-VL (Vanilla)	57.19	21.38	38.30	2180	54.73	77.41	53.31	65.62	63.33	56.57
VLM-R1	57.99	19.73	40.80	2207	55.20	79.66	52.06	66.65	69.67	57.84
ViGoRL	56.10	18.42	34.60	1919	50.46	84.75	54.90	79.21	68.66	57.29
Jigsaw-R1	58.80	22.69	40.30	2184	55.53	78.05	57.53	70.87	69.66	59.05
Jigsaw	58.09	18.42	44.70	2223	55.40	78.68	57.88	71.33	68.67	59.17
Mix	60.60	24.01	49.00	2127	57.53	77.30	57.72	72.88	69.00	60.44

Table 2c. Qwen3-VL on 9 image benchmarks

Model	MathVista	MathVision	MathVerse	MME	MMStar	POPE	MMT	CV-Bench	MMVP	Avg.
Qwen3-VL-2B-Instruct	40.69	15.13	28.10	2072	46.60	82.88	53.53	71.36	70.33	53.62
Qwen3-VL-2B-Jigsaw	42.00	19.07	37.50	2076	48.86	84.81	56.53	72.10	71.00	56.22
Qwen3-VL-4B-Instruct	51.63	22.03	44.30	2138	53.40	87.57	58.77	73.98	75.66	60.41
Qwen3-VL-4B-Jigsaw	53.40	28.28	51.90	2207	55.80	87.56	61.24	76.55	76.33	63.32
Qwen3-VL-8B-Instruct	57.40	23.68	47.90	2243	57.46	86.25	59.96	76.84	75.33	62.77
Qwen3-VL-8B-Jigsaw	58.19	28.28	55.20	2227	57.66	86.58	61.72	76.56	78.00	64.64

Video transfer

Puzzle-based RLVR transfers to video benchmarks too

Without using video data during post-training, Qwen2.5-VL-7B Jigsaw reaches 53.93 Avg. on 7 video reasoning benchmarks. That is +1.47 over the strongest puzzle baseline, VisualSphinx at 52.46, and it comes close to the video-supervised Video-R1 score of 55.11.

Table 3. Qwen2.5-VL-7B on 7 video reasoning benchmarks

Method	CGBench	Video-MMMU	MVBench	TempCompass	Video-MME	Video-TT	QBench-Video	Avg.
Qwen2.5-VL-Instruct	26.71	33.83	55.80	55.00	65.40	34.40	58.50	47.09
Visual Jigsaw	31.56	32.33	55.60	66.26	66.20	37.40	57.48	49.54
VisualSphinx	31.17	43.16	57.40	70.00	72.00	36.40	57.14	52.46
Video-R1	37.00	40.16	63.40	78.20	72.20	39.40	55.44	55.11
ViCrit	35.85	34.00	58.40	54.40	72.20	36.60	56.80	49.75
GRPO-CARE	33.94	34.50	62.00	67.60	62.80	38.80	54.76	50.62
Vision-Zero	27.48	37.50	60.00	51.20	67.80	34.40	58.16	48.07
Jigsaw	36.33	38.00	59.00	75.40	71.40	38.60	58.84	53.93
Mix	34.86	42.66	61.00	71.80	71.00	34.59	56.12	53.14

Additional analyses

Direct mode and puzzle transfer

Table 4. Direct inference mode (no explicit CoT), Qwen2.5-VL-7B

Model	MME	MMStar	POPE	MMT	CV-Bench	MMVP	Avg.
Qwen2.5-VL-Instruct	2308	63.27	86.37	62.74	76.29	78.33	74.90
ViCrit	2178	64.13	85.96	62.58	76.59	76.33	73.90
Vision-Zero	2306	63.53	85.92	63.06	76.00	77.67	74.76
Visual Jigsaw	2313	64.13	86.37	62.20	76.47	79.00	75.13
VisualSphinx	2341	63.73	86.34	62.65	76.76	77.33	75.07
GRPO-CARE	2355	63.93	86.85	63.58	75.81	78.00	75.38
Jigsaw	2348	64.33	86.05	63.70	76.12	77.33	75.23
Mix	2371	64.73	86.65	64.09	77.36	78.67	76.03

Table 5. Inter-puzzle transfer

Model	Jigsaw	PatchFit	Rotation
Qwen2.5-VL-Instruct	25.59	21.2	53.3
Jigsaw	36.65	21.7	33.9
PatchFit	17.88	62.6	51.8
Rotation	25.55	20.9	70.8
Mix	36.83	48.6	83.2

Mixed-puzzle training gives the strongest overall transfer, lifting Jigsaw by +11.24, PatchFit by +27.4, and Rotation by +29.9 over the Qwen2.5-VL-Instruct baseline.

Takeaway

Main message

The paper's conclusion is simple: puzzle-based RLVR has real headroom, but progress depends less on merely having puzzles and more on training dynamics. Flat weighting wastes compute, collapsed rollouts weaken learning, and reward alone is not a reliable proxy for faithful reasoning. PuzzleCraft improves this with exploration-aware weighting and explicit consistency analysis.

Citation

BibTeX

@misc{jeddi2026puzzlecraftexplorationawarecurriculumlearning,
          title={PuzzleCraft: Exploration-Aware Curriculum Learning for Puzzle-Based RLVR in VLMs}, 
          author={Ahmadreza Jeddi and Hakki Can Karaimer and Hue Nguyen and Zhongling Wang and Ke Zhao and Javad Rajabi and Ran Zhang and Raghav Goyal and Konstantinos G. Derpanis and Babak Taati and Radek Grzeszczuk},
          year={2026},
          eprint={2512.14944},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2512.14944}, 
    }