Saved in:
| Main Authors: | Xu, Zhiyang, Qin, Tian, Jin, Bowen, Lai, Zhengfeng, Cao, Meng, Huang, Lifu, Zhang, Peng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.27184 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
by: Feng, Bo, et al.
Published: (2025)
by: Feng, Bo, et al.
Published: (2025)
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
by: Zhang, Yue, et al.
Published: (2024)
by: Zhang, Yue, et al.
Published: (2024)
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
by: Wang, Haibo, et al.
Published: (2024)
by: Wang, Haibo, et al.
Published: (2024)
Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
by: Xu, Boshen, et al.
Published: (2024)
by: Xu, Boshen, et al.
Published: (2024)
EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
by: Sun, Shitong, et al.
Published: (2026)
by: Sun, Shitong, et al.
Published: (2026)
SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)
by: Xu, Mingze, et al.
Published: (2025)
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
by: Qi, Jingyuan, et al.
Published: (2025)
by: Qi, Jingyuan, et al.
Published: (2025)
Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
by: Wang, Haibo, et al.
Published: (2026)
by: Wang, Haibo, et al.
Published: (2026)
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
by: Guan, Kaisi, et al.
Published: (2025)
by: Guan, Kaisi, et al.
Published: (2025)
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
by: Xu, Boshen, et al.
Published: (2025)
by: Xu, Boshen, et al.
Published: (2025)
Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction
by: Guan, Kaisi, et al.
Published: (2025)
by: Guan, Kaisi, et al.
Published: (2025)
SuperFlow: Training Flow Matching Models with RL on the Fly
by: Chen, Kaijie, et al.
Published: (2025)
by: Chen, Kaijie, et al.
Published: (2025)
ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model
by: Chen, Wenshuo, et al.
Published: (2025)
by: Chen, Wenshuo, et al.
Published: (2025)
EgoLCD: Egocentric Video Generation with Long Context Diffusion
by: Zhang, Liuzhou, et al.
Published: (2025)
by: Zhang, Liuzhou, et al.
Published: (2025)
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
by: Wang, Haibo, et al.
Published: (2025)
by: Wang, Haibo, et al.
Published: (2025)
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
by: Zhang, Haoyu, et al.
Published: (2025)
by: Zhang, Haoyu, et al.
Published: (2025)
TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment
by: Li, Shicheng, et al.
Published: (2025)
by: Li, Shicheng, et al.
Published: (2025)
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
by: Ma, Jianzhe, et al.
Published: (2026)
by: Ma, Jianzhe, et al.
Published: (2026)
Multimodal Instruction Tuning with Conditional Mixture of LoRA
by: Shen, Ying, et al.
Published: (2024)
by: Shen, Ying, et al.
Published: (2024)
Retrieval-Augmented Egocentric Video Captioning
by: Xu, Jilan, et al.
Published: (2024)
by: Xu, Jilan, et al.
Published: (2024)
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
by: Plizzari, Chiara, et al.
Published: (2025)
by: Plizzari, Chiara, et al.
Published: (2025)
In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting
by: Peng, Taiying, et al.
Published: (2025)
by: Peng, Taiying, et al.
Published: (2025)
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding
by: Zhou, Wenqi, et al.
Published: (2025)
by: Zhou, Wenqi, et al.
Published: (2025)
Egocentric Visibility-Aware Human Pose Estimation
by: Dai, Peng, et al.
Published: (2026)
by: Dai, Peng, et al.
Published: (2026)
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
by: Zhang, Lang, et al.
Published: (2026)
by: Zhang, Lang, et al.
Published: (2026)
MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos
by: Ma, Junyi, et al.
Published: (2024)
by: Ma, Junyi, et al.
Published: (2024)
EgoSound: Benchmarking Sound Understanding in Egocentric Videos
by: Zhu, Bingwen, et al.
Published: (2026)
by: Zhu, Bingwen, et al.
Published: (2026)
EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation
by: Zhang, Libo, et al.
Published: (2025)
by: Zhang, Libo, et al.
Published: (2025)
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
by: Luo, Fuwen, et al.
Published: (2025)
by: Luo, Fuwen, et al.
Published: (2025)
Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
by: Xu, Zhiyang, et al.
Published: (2025)
by: Xu, Zhiyang, et al.
Published: (2025)
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
by: Liu, Zichen, et al.
Published: (2025)
by: Liu, Zichen, et al.
Published: (2025)
EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos
by: Ma, Junyi, et al.
Published: (2025)
by: Ma, Junyi, et al.
Published: (2025)
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
by: Jin, Peng, et al.
Published: (2023)
by: Jin, Peng, et al.
Published: (2023)
AdsQA: Towards Advertisement Video Understanding
by: Long, Xinwei, et al.
Published: (2025)
by: Long, Xinwei, et al.
Published: (2025)
EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
by: Seth, Ashish, et al.
Published: (2025)
by: Seth, Ashish, et al.
Published: (2025)
Exploring Audio Hallucination in Egocentric Video Understanding
by: Seth, Ashish, et al.
Published: (2026)
by: Seth, Ashish, et al.
Published: (2026)
EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
by: Pei, Baoqi, et al.
Published: (2024)
by: Pei, Baoqi, et al.
Published: (2024)
Zero-Shot Temporal Interaction Localization for Egocentric Videos
by: Zhang, Erhang, et al.
Published: (2025)
by: Zhang, Erhang, et al.
Published: (2025)
Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis
by: Lai, Haoran, et al.
Published: (2026)
by: Lai, Haoran, et al.
Published: (2026)
Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
by: Tan, Yuting, et al.
Published: (2026)
by: Tan, Yuting, et al.
Published: (2026)
Similar Items
-
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
by: Feng, Bo, et al.
Published: (2025) -
SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
by: Zhang, Yue, et al.
Published: (2024) -
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
by: Wang, Haibo, et al.
Published: (2024) -
Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
by: Xu, Boshen, et al.
Published: (2024) -
EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
by: Sun, Shitong, et al.
Published: (2026)