Saved in:
Bibliographic Details
Main Authors: Wang, Yuan, Li, Ouxiang, Xu, Yulong, Liao, Borui, Liang, Jiajun, Li, Jinghan, Wang, Meng, Wang, Xintao, Wan, Pengfei, Liu, Kuien, Wang, Xiang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.05922
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909034970349568
author Wang, Yuan
Li, Ouxiang
Xu, Yulong
Liao, Borui
Liang, Jiajun
Li, Jinghan
Wang, Meng
Wang, Xintao
Wan, Pengfei
Liu, Kuien
Wang, Xiang
author_facet Wang, Yuan
Li, Ouxiang
Xu, Yulong
Liao, Borui
Liang, Jiajun
Li, Jinghan
Wang, Meng
Wang, Xintao
Wan, Pengfei
Liu, Kuien
Wang, Xiang
contents Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.
format Preprint
id arxiv_https___arxiv_org_abs_2605_05922
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
Wang, Yuan
Li, Ouxiang
Xu, Yulong
Liao, Borui
Liang, Jiajun
Li, Jinghan
Wang, Meng
Wang, Xintao
Wan, Pengfei
Liu, Kuien
Wang, Xiang
Computer Vision and Pattern Recognition
Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.
title Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.05922