Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Tinghui, Zhang, Sheng, Huang, James Y., Song, Selena, Wen, Xiaofei, Li, Yuankai, Poon, Hoifung, Chen, Muhao
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.15458
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910223414853632
author	Zhu, Tinghui Zhang, Sheng Huang, James Y. Song, Selena Wen, Xiaofei Li, Yuankai Poon, Hoifung Chen, Muhao
author_facet	Zhu, Tinghui Zhang, Sheng Huang, James Y. Song, Selena Wen, Xiaofei Li, Yuankai Poon, Hoifung Chen, Muhao
contents	Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_15458
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Video Models Can Reason with Verifiable Rewards Zhu, Tinghui Zhang, Sheng Huang, James Y. Song, Selena Wen, Xiaofei Li, Yuankai Poon, Hoifung Chen, Muhao Computer Vision and Pattern Recognition Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
title	Video Models Can Reason with Verifiable Rewards
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.15458

Similar Items