Saved in:
Bibliographic Details
Main Authors: Wang, Yuan, Liao, Borui, Huang, Huijuan, Lu, Jinda, Li, Ouxiang, Liu, Kuien, Wang, Meng, Wang, Xiang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.04033
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911545780338688
author Wang, Yuan
Liao, Borui
Huang, Huijuan
Lu, Jinda
Li, Ouxiang
Liu, Kuien
Wang, Meng
Wang, Xiang
author_facet Wang, Yuan
Liao, Borui
Huang, Huijuan
Lu, Jinda
Li, Ouxiang
Liu, Kuien
Wang, Meng
Wang, Xiang
contents Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: (1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
format Preprint
id arxiv_https___arxiv_org_abs_2601_04033
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
Wang, Yuan
Liao, Borui
Huang, Huijuan
Lu, Jinda
Li, Ouxiang
Liu, Kuien
Wang, Meng
Wang, Xiang
Computer Vision and Pattern Recognition
Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: (1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
title Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.04033