Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cheng, Jen-Hao, Wang, Vivian, Wang, Huayu, Zhou, Huapeng, Peng, Yi-Hao, Liu, Hou-I, Huang, Hsiang-Wei, Chen, Kuang-Ming, Yang, Cheng-Yen, Chai, Wenhao, Chen, Yi-Ling, Vineet, Vibhav, Cai, Qin, Hwang, Jenq-Neng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.01583
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909599736528896
author	Cheng, Jen-Hao Wang, Vivian Wang, Huayu Zhou, Huapeng Peng, Yi-Hao Liu, Hou-I Huang, Hsiang-Wei Chen, Kuang-Ming Yang, Cheng-Yen Chai, Wenhao Chen, Yi-Ling Vineet, Vibhav Cai, Qin Hwang, Jenq-Neng
author_facet	Cheng, Jen-Hao Wang, Vivian Wang, Huayu Zhou, Huapeng Peng, Yi-Hao Liu, Hou-I Huang, Hsiang-Wei Chen, Kuang-Ming Yang, Cheng-Yen Chai, Wenhao Chen, Yi-Ling Vineet, Vibhav Cai, Qin Hwang, Jenq-Neng
contents	Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_01583
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action Cheng, Jen-Hao Wang, Vivian Wang, Huayu Zhou, Huapeng Peng, Yi-Hao Liu, Hou-I Huang, Hsiang-Wei Chen, Kuang-Ming Yang, Cheng-Yen Chai, Wenhao Chen, Yi-Ling Vineet, Vibhav Cai, Qin Hwang, Jenq-Neng Computer Vision and Pattern Recognition Artificial Intelligence Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.
title	TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2505.01583

Similar Items