:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cheng, Tongtong, Li, Rongzhen, Xiong, Yixin, Zhang, Tao, Wang, Jing, Liu, Kai
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.06072
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Other Vehicle Trajectories Are Also Needed: A Driving World Model Unifies Ego-Other Vehicle Trajectories in Video Latent Space
by: Zhu, Jian, et al.
Published: (2025)

Abductive Ego-View Accident Video Understanding for Safe Driving Perception
by: Fang, Jianwu, et al.
Published: (2024)

FlowAD: Ego-Scene Interactive Modeling for Autonomous Driving
by: Guo, Mingzhe, et al.
Published: (2026)

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding
by: Zhang, Haoyu, et al.
Published: (2025)

Understanding Attention Mechanism in Video Diffusion Models
by: Liu, Bingyan, et al.
Published: (2025)

EgoSound: Benchmarking Sound Understanding in Egocentric Videos
by: Zhu, Bingwen, et al.
Published: (2026)

EgoVLM: Policy Optimization for Egocentric Video Understanding
by: Vinod, Ashwin, et al.
Published: (2025)

EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving
by: Schäfer, Finn Rasmus, et al.
Published: (2026)

EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
by: Xu, Jilan, et al.
Published: (2025)

Intention-driven Ego-to-Exo Video Generation
by: Luo, Hongchen, et al.
Published: (2024)

Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video
by: Zhang, Yulin, et al.
Published: (2025)

EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
by: Sun, Shitong, et al.
Published: (2026)

EgoExo-WM: Unlocking Exo Video for Ego World Models
by: Tran, Danny, et al.
Published: (2026)

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
by: Cheng, Sijie, et al.
Published: (2024)

Predicting Video Slot Attention Queries from Random Slot-Feature Pairs
by: Zhao, Rongzhen, et al.
Published: (2025)

Internalizing Temporal Consistency in Video Object-Centric Learning without Explicit Regularization
by: Zhao, Rongzhen, et al.
Published: (2026)

UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving
by: Xiong, Zhexiao, et al.
Published: (2026)

Causality Model for Semantic Understanding on Videos
by: Yicong, Li
Published: (2025)

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
by: Gao, Hongcheng, et al.
Published: (2025)

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning
by: Chen, Yi, et al.
Published: (2023)

EgoAVU: Egocentric Audio-Visual Understanding
by: Seth, Ashish, et al.
Published: (2026)

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting
by: Zhang, Daiwei, et al.
Published: (2024)

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
by: Zhang, Boqiang, et al.
Published: (2025)

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
by: Nagrani, Arsha, et al.
Published: (2026)

EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation
by: Leonardi, Rosario, et al.
Published: (2026)

Cycle Consistency in Video Object-Centric Learning
by: Zhao, Rongzhen, et al.
Published: (2026)

Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving
by: Wang, Linhan, et al.
Published: (2026)

SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models
by: Liu, Bo, et al.
Published: (2025)

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
by: Jung, Minjoon, et al.
Published: (2025)

Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024)

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos
by: Wu, Peiran, et al.
Published: (2025)

Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives
by: Peirone, Simone Alberto, et al.
Published: (2025)

Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
by: Ge, Mengmeng, et al.
Published: (2026)

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology
by: Sun, Yuxuan, et al.
Published: (2024)

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
by: Wu, Tz-Ying, et al.
Published: (2024)

Quantifying and Mitigating Unimodal Biases in Multimodal Large Language Models: A Causal Perspective
by: Chen, Meiqi, et al.
Published: (2024)

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
by: Wang, Yi, et al.
Published: (2024)

CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
by: He, Yuxin, et al.
Published: (2026)

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation
by: Pei, Baoqi, et al.
Published: (2024)

MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA
by: Ye, Hanrong, et al.
Published: (2024)