Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.04415 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915988696465408 |
|---|---|
| author | Li, Zinuo Guo, Yongxin Liu, Jun Zhan, Jiawei Jiang, Xi Wang, Chengjie Bennamoun, Mohammed Boussaid, Farid Zheng, Feng Ke, Qiuhong |
| author_facet | Li, Zinuo Guo, Yongxin Liu, Jun Zhan, Jiawei Jiang, Xi Wang, Chengjie Bennamoun, Mohammed Boussaid, Farid Zheng, Feng Ke, Qiuhong |
| contents | Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_04415 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning Li, Zinuo Guo, Yongxin Liu, Jun Zhan, Jiawei Jiang, Xi Wang, Chengjie Bennamoun, Mohammed Boussaid, Farid Zheng, Feng Ke, Qiuhong Computation and Language Human understanding of video dynamics relies on forming structured representations of entities, actions, and temporal relations before engaging in abstract reasoning. In contrast, existing Video-LLMs apply unstructured chain-of-thought directly to raw visual tokens, where critical temporal cues are buried in verbose narration and event-level structure is largely overlooked. We propose Structured Event Evidence, which represents a video as a compact, time-ordered event schema capturing salient events with key attributes and inter-event temporal dependencies, enabling evidence-grounded reasoning through a constrained verification process. This design promotes concise, interpretable reasoning while reducing the drift typical of unconstrained chain-of-thought. To train models under this paradigm, we introduce STEER-60K, a dataset with a four-stage progressive pipeline: evidence training, format warm-start, thinking warm-start, and RL post-training. During RL, CoT length and task accuracy often conflict while rewards for hard samples are too sparse, causing the policy to neglect challenging instances. We formulate this as a multi-objective Pareto optimality problem and propose Pareto-Frontier guided Advantage Balancing (P-FAB), which dynamically resolves reward conflicts and identifies balanced optimization directions along the Pareto frontier. The resulting model STEER-4B rivals 7B-scale baselines on video understanding tasks with half the input frames Code and data will be released. |
| title | STEER: Structured Event Evidence for Video Reasoning via Multi-Objective Reinforcement Learning |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2604.04415 |