:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xu, Zhiyang, Qin, Tian, Jin, Bowen, Lai, Zhengfeng, Cao, Meng, Huang, Lifu, Zhang, Peng
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.27184
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?
by: Feng, Bo, et al.
Published: (2025)

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models
by: Zhang, Yue, et al.
Published: (2024)

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
by: Wang, Haibo, et al.
Published: (2024)

Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions?
by: Xu, Boshen, et al.
Published: (2024)

EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
by: Sun, Shitong, et al.
Published: (2026)

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding
by: Xu, Mingze, et al.
Published: (2025)

AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
by: Qi, Jingyuan, et al.
Published: (2025)

Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
by: Wang, Haibo, et al.
Published: (2026)

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
by: Guan, Kaisi, et al.
Published: (2025)

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
by: Xu, Boshen, et al.
Published: (2025)

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction
by: Guan, Kaisi, et al.
Published: (2025)

SuperFlow: Training Flow Matching Models with RL on the Fly
by: Chen, Kaijie, et al.
Published: (2025)

ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model
by: Chen, Wenshuo, et al.
Published: (2025)

EgoLCD: Egocentric Video Generation with Long Context Diffusion
by: Zhang, Liuzhou, et al.
Published: (2025)

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
by: Wang, Haibo, et al.
Published: (2025)

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
by: Zhang, Haoyu, et al.
Published: (2025)

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment
by: Li, Shicheng, et al.
Published: (2025)

EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
by: Ma, Jianzhe, et al.
Published: (2026)

Multimodal Instruction Tuning with Conditional Mixture of LoRA
by: Shen, Ying, et al.
Published: (2024)

Retrieval-Augmented Egocentric Video Captioning
by: Xu, Jilan, et al.
Published: (2024)

Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos
by: Plizzari, Chiara, et al.
Published: (2025)

In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting
by: Peng, Taiying, et al.
Published: (2025)

X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding
by: Zhou, Wenqi, et al.
Published: (2025)

Egocentric Visibility-Aware Human Pose Estimation
by: Dai, Peng, et al.
Published: (2026)

EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
by: Zhang, Lang, et al.
Published: (2026)

MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos
by: Ma, Junyi, et al.
Published: (2024)

EgoSound: Benchmarking Sound Understanding in Egocentric Videos
by: Zhu, Bingwen, et al.
Published: (2026)

EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation
by: Zhang, Libo, et al.
Published: (2025)

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
by: Luo, Fuwen, et al.
Published: (2025)

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
by: Xu, Zhiyang, et al.
Published: (2025)

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
by: Liu, Zichen, et al.
Published: (2025)

EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos
by: Ma, Junyi, et al.
Published: (2025)

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
by: Jin, Peng, et al.
Published: (2023)

AdsQA: Towards Advertisement Video Understanding
by: Long, Xinwei, et al.
Published: (2025)

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
by: Seth, Ashish, et al.
Published: (2025)

Exploring Audio Hallucination in Egocentric Video Understanding
by: Seth, Ashish, et al.
Published: (2026)

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
by: Pei, Baoqi, et al.
Published: (2024)

Zero-Shot Temporal Interaction Localization for Egocentric Videos
by: Zhang, Erhang, et al.
Published: (2025)

Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis
by: Lai, Haoran, et al.
Published: (2026)

Towards Stable Self-Supervised Object Representations in Unconstrained Egocentric Video
by: Tan, Yuting, et al.
Published: (2026)