Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Yuan, Liao, Borui, Huang, Huijuan, Lu, Jinda, Li, Ouxiang, Liu, Kuien, Wang, Meng, Wang, Xiang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.04033
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911545780338688
author	Wang, Yuan Liao, Borui Huang, Huijuan Lu, Jinda Li, Ouxiang Liu, Kuien Wang, Meng Wang, Xiang
author_facet	Wang, Yuan Liao, Borui Huang, Huijuan Lu, Jinda Li, Ouxiang Liu, Kuien Wang, Meng Wang, Xiang
contents	Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: (1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_04033
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model Wang, Yuan Liao, Borui Huang, Huijuan Lu, Jinda Li, Ouxiang Liu, Kuien Wang, Meng Wang, Xiang Computer Vision and Pattern Recognition Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: (1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
title	Thinking with Frames: Generative Video Distortion Evaluation via Frame Reward Model
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.04033

Similar Items