Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Zinuo, Guo, Yongxin, Liu, Jun, Zhan, Jiawei, Jiang, Xi, Wang, Chengjie, Bennamoun, Mohammed, Boussaid, Farid, Zheng, Feng, Ke, Qiuhong
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.04415
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released.

Similar Items