Saved in:
| Main Authors: | Feng, Bo, Lai, Zhengfeng, Li, Shiyu, Wang, Zizhen, Wang, Simon, Huang, Ping, Cao, Meng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.14321 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
by: Xu, Zhiyang, et al.
Published: (2026)
by: Xu, Zhiyang, et al.
Published: (2026)
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
by: Wang, Haibo, et al.
Published: (2025)
by: Wang, Haibo, et al.
Published: (2025)
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)
by: Xu, Mingze, et al.
Published: (2025)
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
by: Wang, Xiao, et al.
Published: (2024)
by: Wang, Xiao, et al.
Published: (2024)
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
by: Shi, Jiapeng, et al.
Published: (2026)
by: Shi, Jiapeng, et al.
Published: (2026)
Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding
by: Hu, Pengfei, et al.
Published: (2025)
by: Hu, Pengfei, et al.
Published: (2025)
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction
by: Guan, Kaisi, et al.
Published: (2025)
by: Guan, Kaisi, et al.
Published: (2025)
Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding
by: Luo, Meng, et al.
Published: (2025)
by: Luo, Meng, et al.
Published: (2025)
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
by: Guan, Kaisi, et al.
Published: (2025)
by: Guan, Kaisi, et al.
Published: (2025)
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
by: Hong, Wenyi, et al.
Published: (2025)
by: Hong, Wenyi, et al.
Published: (2025)
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
by: Yuan, Yuqian, et al.
Published: (2024)
by: Yuan, Yuqian, et al.
Published: (2024)
Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding
by: Deng, Andong, et al.
Published: (2024)
by: Deng, Andong, et al.
Published: (2024)
EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
by: Sun, Shitong, et al.
Published: (2026)
by: Sun, Shitong, et al.
Published: (2026)
LVBench: An Extreme Long Video Understanding Benchmark
by: Wang, Weihan, et al.
Published: (2024)
by: Wang, Weihan, et al.
Published: (2024)
PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition
by: Hao, Yanbin, et al.
Published: (2024)
by: Hao, Yanbin, et al.
Published: (2024)
MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding
by: Bai, Purui, et al.
Published: (2026)
by: Bai, Purui, et al.
Published: (2026)
VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos
by: Liu, Pengyiang, et al.
Published: (2026)
by: Liu, Pengyiang, et al.
Published: (2026)
TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes
by: Zhou, Xingcheng, et al.
Published: (2025)
by: Zhou, Xingcheng, et al.
Published: (2025)
VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction
by: Wang, Shaobo, et al.
Published: (2025)
by: Wang, Shaobo, et al.
Published: (2025)
VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding
by: Zhao, Henghao, et al.
Published: (2025)
by: Zhao, Henghao, et al.
Published: (2025)
V-CORE: Temporally Consistent Video Understanding for Video-LLM
by: Kang, Zhengjian, et al.
Published: (2026)
by: Kang, Zhengjian, et al.
Published: (2026)
SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
by: Nie, Ming, et al.
Published: (2026)
by: Nie, Ming, et al.
Published: (2026)
Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)
by: Wang, Youze, et al.
Published: (2025)
E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs
by: Liu, Xianjie, et al.
Published: (2026)
by: Liu, Xianjie, et al.
Published: (2026)
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs
by: Liao, Ruotong, et al.
Published: (2024)
by: Liao, Ruotong, et al.
Published: (2024)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
by: Li, Kunchang, et al.
Published: (2023)
by: Li, Kunchang, et al.
Published: (2023)
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
by: Liu, Zichen, et al.
Published: (2025)
by: Liu, Zichen, et al.
Published: (2025)
TV-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
by: Cao, Zongsheng, et al.
Published: (2025)
by: Cao, Zongsheng, et al.
Published: (2025)
Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency
by: Wang, Yutong, et al.
Published: (2024)
by: Wang, Yutong, et al.
Published: (2024)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
by: Wang, Yuji, et al.
Published: (2025)
by: Wang, Yuji, et al.
Published: (2025)
VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents
by: Wang, Feng, et al.
Published: (2026)
by: Wang, Feng, et al.
Published: (2026)
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2026)
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2026)
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization
by: Wang, Xiaoqi, et al.
Published: (2025)
by: Wang, Xiaoqi, et al.
Published: (2025)
RS3DBench: A Comprehensive Benchmark for 3D Spatial Perception in Remote Sensing
by: Wang, Jiayu, et al.
Published: (2025)
by: Wang, Jiayu, et al.
Published: (2025)
Active Perception Agent for Omnimodal Audio-Video Understanding
by: Tao, Keda, et al.
Published: (2025)
by: Tao, Keda, et al.
Published: (2025)
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
by: Wang, Haibo, et al.
Published: (2024)
by: Wang, Haibo, et al.
Published: (2024)
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
by: Li, Xinhao, et al.
Published: (2025)
by: Li, Xinhao, et al.
Published: (2025)
Benchmarking Video Frame Interpolation
by: Kiefhaber, Simon, et al.
Published: (2024)
by: Kiefhaber, Simon, et al.
Published: (2024)
Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection
by: Zheng, Huan, et al.
Published: (2025)
by: Zheng, Huan, et al.
Published: (2025)
VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding
by: Chen, Houlun, et al.
Published: (2024)
by: Chen, Houlun, et al.
Published: (2024)
Similar Items
-
Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
by: Xu, Zhiyang, et al.
Published: (2026) -
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
by: Wang, Haibo, et al.
Published: (2025) -
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025) -
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
by: Wang, Xiao, et al.
Published: (2024) -
VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
by: Shi, Jiapeng, et al.
Published: (2026)