Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Zinuo, Guo, Yongxin, Liu, Jun, Zhan, Jiawei, Jiang, Xi, Wang, Chengjie, Bennamoun, Mohammed, Boussaid, Farid, Zheng, Feng, Ke, Qiuhong
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.04415
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915988696465408
author	Li, Zinuo Guo, Yongxin Liu, Jun Zhan, Jiawei Jiang, Xi Wang, Chengjie Bennamoun, Mohammed Boussaid, Farid Zheng, Feng Ke, Qiuhong
author_facet	Li, Zinuo Guo, Yongxin Liu, Jun Zhan, Jiawei Jiang, Xi Wang, Chengjie Bennamoun, Mohammed Boussaid, Farid Zheng, Feng Ke, Qiuhong
contents	Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_04415
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning Li, Zinuo Guo, Yongxin Liu, Jun Zhan, Jiawei Jiang, Xi Wang, Chengjie Bennamoun, Mohammed Boussaid, Farid Zheng, Feng Ke, Qiuhong Computation and Language Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released.
title	STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning
topic	Computation and Language
url	https://arxiv.org/abs/2604.04415

Similar Items