:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Feng, Bo, Lai, Zhengfeng, Li, Shiyu, Wang, Zizhen, Wang, Simon, Huang, Ping, Cao, Meng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.14321
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
by: Xu, Zhiyang, et al.
Published: (2026)

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
by: Wang, Haibo, et al.
Published: (2025)

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)

ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
by: Wang, Xiao, et al.
Published: (2024)

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
by: Shi, Jiapeng, et al.
Published: (2026)

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
by: Hu, Pengfei, et al.
Published: (2025)

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction
by: Guan, Kaisi, et al.
Published: (2025)

Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding
by: Luo, Meng, et al.
Published: (2025)

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
by: Guan, Kaisi, et al.
Published: (2025)

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
by: Hong, Wenyi, et al.
Published: (2025)

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
by: Yuan, Yuqian, et al.
Published: (2024)

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
by: Deng, Andong, et al.
Published: (2024)

EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
by: Sun, Shitong, et al.
Published: (2026)

LVBench: An Extreme Long Video Understanding Benchmark
by: Wang, Weihan, et al.
Published: (2024)

PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition
by: Hao, Yanbin, et al.
Published: (2024)

MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding
by: Bai, Purui, et al.
Published: (2026)

VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos
by: Liu, Pengyiang, et al.
Published: (2026)

TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes
by: Zhou, Xingcheng, et al.
Published: (2025)

VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
by: Wang, Shaobo, et al.
Published: (2025)

VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding
by: Zhao, Henghao, et al.
Published: (2025)

V-CORE: Temporally Consistent Video Understanding for Video-LLM
by: Kang, Zhengjian, et al.
Published: (2026)

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
by: Nie, Ming, et al.
Published: (2026)

Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs
by: Liu, Xianjie, et al.
Published: (2026)

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs
by: Liao, Ruotong, et al.
Published: (2024)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
by: Li, Kunchang, et al.
Published: (2023)

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
by: Liu, Zichen, et al.
Published: (2025)

TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
by: Cao, Zongsheng, et al.
Published: (2025)

Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency
by: Wang, Yutong, et al.
Published: (2024)

Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
by: Wang, Yuji, et al.
Published: (2025)

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents
by: Wang, Feng, et al.
Published: (2026)

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2026)

EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
by: Wang, Xiaoqi, et al.
Published: (2025)

RS3DBench: A Comprehensive Benchmark for 3D Spatial Perception in Remote Sensing
by: Wang, Jiayu, et al.
Published: (2025)

Active Perception Agent for Omnimodal Audio-Video Understanding
by: Tao, Keda, et al.
Published: (2025)

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
by: Wang, Haibo, et al.
Published: (2024)

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
by: Li, Xinhao, et al.
Published: (2025)

Benchmarking Video Frame Interpolation
by: Kiefhaber, Simon, et al.
Published: (2024)

Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection
by: Zheng, Huan, et al.
Published: (2025)

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding
by: Chen, Houlun, et al.
Published: (2024)