Saved in:
| Main Authors: | Yu, Xueqing, Li, Bohan, Li, Yan, Yang, Zhenheng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.07071 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
by: Tao, Keda, et al.
Published: (2026)
by: Tao, Keda, et al.
Published: (2026)
UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation
by: Chen, Yanzhe, et al.
Published: (2025)
by: Chen, Yanzhe, et al.
Published: (2025)
FOCUS: Efficient Keyframe Selection for Long Video Understanding
by: Zhu, Zirui, et al.
Published: (2025)
by: Zhu, Zirui, et al.
Published: (2025)
KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding
by: Li, Zongyao, et al.
Published: (2025)
by: Li, Zongyao, et al.
Published: (2025)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)
by: Wu, Haoning, et al.
Published: (2024)
LongVLM: Efficient Long Video Understanding via Large Language Models
by: Weng, Yuetian, et al.
Published: (2024)
by: Weng, Yuetian, et al.
Published: (2024)
Long Context Tuning for Video Generation
by: Guo, Yuwei, et al.
Published: (2025)
by: Guo, Yuwei, et al.
Published: (2025)
TextVidBench: A Benchmark for Long Video Scene Text Understanding
by: Zhong, Yangyang, et al.
Published: (2025)
by: Zhong, Yangyang, et al.
Published: (2025)
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework
by: Dong, Xin, et al.
Published: (2024)
by: Dong, Xin, et al.
Published: (2024)
LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
by: Han, ZhaoYang, et al.
Published: (2025)
by: Han, ZhaoYang, et al.
Published: (2025)
VidText: Towards Comprehensive Evaluation for Video Text Understanding
by: Yang, Zhoufaran, et al.
Published: (2025)
by: Yang, Zhoufaran, et al.
Published: (2025)
ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation
by: Zhou, Shizhe, et al.
Published: (2026)
by: Zhou, Shizhe, et al.
Published: (2026)
Video-XL-Pro: Reconstructive Token Compression for Extremely Long Video Understanding
by: Liu, Xiangrui, et al.
Published: (2025)
by: Liu, Xiangrui, et al.
Published: (2025)
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
by: Yin, Yufei, et al.
Published: (2026)
by: Yin, Yufei, et al.
Published: (2026)
Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)
by: Wang, Youze, et al.
Published: (2025)
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
by: Ma, Wentao, et al.
Published: (2025)
by: Ma, Wentao, et al.
Published: (2025)
RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation
by: Li, Huiqiong, et al.
Published: (2026)
by: Li, Huiqiong, et al.
Published: (2026)
VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
by: Zhi, Zhuo, et al.
Published: (2025)
by: Zhi, Zhuo, et al.
Published: (2025)
Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs
by: Zhang, Zicheng, et al.
Published: (2024)
by: Zhang, Zicheng, et al.
Published: (2024)
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding
by: Zhou, Wenqi, et al.
Published: (2025)
by: Zhou, Wenqi, et al.
Published: (2025)
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
by: Chen, Guo, et al.
Published: (2024)
by: Chen, Guo, et al.
Published: (2024)
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
by: Tang, Yuqi, et al.
Published: (2026)
by: Tang, Yuqi, et al.
Published: (2026)
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
by: Feng, Hengyi, et al.
Published: (2026)
by: Feng, Hengyi, et al.
Published: (2026)
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
by: Liu, Xuannan, et al.
Published: (2025)
by: Liu, Xuannan, et al.
Published: (2025)
LvBench: A Benchmark for Long-form Video Understanding with Versatile Multi-modal Question Answering
by: Zhang, Hongjie, et al.
Published: (2023)
by: Zhang, Hongjie, et al.
Published: (2023)
CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World
by: Yu, Yating, et al.
Published: (2025)
by: Yu, Yating, et al.
Published: (2025)
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
by: Ma, Yubo, et al.
Published: (2024)
by: Ma, Yubo, et al.
Published: (2024)
Active Perception Agent for Omnimodal Audio-Video Understanding
by: Tao, Keda, et al.
Published: (2025)
by: Tao, Keda, et al.
Published: (2025)
VideoPro: Adaptive Program Reasoning for Long Video Understanding
by: Li, Chenglin, et al.
Published: (2025)
by: Li, Chenglin, et al.
Published: (2025)
Video Token Merging for Long-form Video Understanding
by: Lee, Seon-Ho, et al.
Published: (2024)
by: Lee, Seon-Ho, et al.
Published: (2024)
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
by: Athar, Ali, et al.
Published: (2024)
by: Athar, Ali, et al.
Published: (2024)
VEU-Bench: Towards Comprehensive Understanding of Video Editing
by: Li, Bozheng, et al.
Published: (2025)
by: Li, Bozheng, et al.
Published: (2025)
InstructionBench: An Instructional Video Understanding Benchmark
by: Wei, Haiwan, et al.
Published: (2025)
by: Wei, Haiwan, et al.
Published: (2025)
Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024)
by: Ranasinghe, Kanchana, et al.
Published: (2024)
Long Video Understanding with Learnable Retrieval in Video-Language Models
by: Xu, Jiaqi, et al.
Published: (2023)
by: Xu, Jiaqi, et al.
Published: (2023)
VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding
by: Zhang, Zhihong, et al.
Published: (2025)
by: Zhang, Zhihong, et al.
Published: (2025)
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
by: Xun, Shuhang, et al.
Published: (2025)
by: Xun, Shuhang, et al.
Published: (2025)
MR. Video: "MapReduce" is the Principle for Long Video Understanding
by: Pang, Ziqi, et al.
Published: (2025)
by: Pang, Ziqi, et al.
Published: (2025)
DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding
by: Feng, Xiang, et al.
Published: (2026)
by: Feng, Xiang, et al.
Published: (2026)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
by: Jiang, Jindong, et al.
Published: (2025)
by: Jiang, Jindong, et al.
Published: (2025)
Similar Items
-
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
by: Tao, Keda, et al.
Published: (2026) -
UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation
by: Chen, Yanzhe, et al.
Published: (2025) -
FOCUS: Efficient Keyframe Selection for Long Video Understanding
by: Zhu, Zirui, et al.
Published: (2025) -
KFS-Bench: Comprehensive Evaluation of Key Frame Sampling in Long Video Understanding
by: Li, Zongyao, et al.
Published: (2025) -
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)