:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sun, Pengzhan, Xiao, Junbin, Tse, Tze Ho Elden, Li, Yicong, Akula, Arjun, Yao, Angela
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2504.13621
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Improving Human Motion Plausibility with Body Momentum
by: Nguyen, Ha Linh, et al.
Published: (2025)

DAS3R: Dynamics-Aware Gaussian Splatting for Static Scene Reconstruction
by: Xu, Kai, et al.
Published: (2024)

Can I Trust Your Answer? Visually Grounded Video Question Answering
by: Xiao, Junbin, et al.
Published: (2023)

A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds
by: Peng, Jizong, et al.
Published: (2025)

Humans as Checkerboards: Calibrating Camera Motion Scale for World-Coordinate Human Mesh Recovery
by: Yang, Fengyuan, et al.
Published: (2024)

Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation
by: Liu, Ruicong, et al.
Published: (2025)

TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction
by: Huang, Yiyao, et al.
Published: (2025)

Ego-Grounding for Personalized Question-Answering in Egocentric Videos
by: Xiao, Junbin, et al.
Published: (2026)

SA-GS: Semantic-Aware Gaussian Splatting for Large Scene Reconstruction with Geometry Constrain
by: Xiong, Butian, et al.
Published: (2024)

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
by: Zhou, Sheng, et al.
Published: (2025)

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics
by: Tse, Tze Ho Elden, et al.
Published: (2025)

GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement
by: Zheng, Linfang, et al.
Published: (2024)

High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation
by: Feng, Runyang, et al.
Published: (2025)

Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning
by: Cheng, Qinchuan, et al.
Published: (2026)

EgoBlind: Towards Egocentric Visual Assistance for the Blind
by: Xiao, Junbin, et al.
Published: (2025)

Analyzing the Synthetic-to-Real Domain Gap in 3D Hand Pose Estimation
by: Zhao, Zhuoran, et al.
Published: (2025)

Question-Answering Dense Video Events
by: Qin, Hangyu, et al.
Published: (2024)

TeleEgo: Benchmarking Egocentric AI Assistants in the Wild
by: Yan, Jiaqi, et al.
Published: (2025)

On the Consistency of Video Large Language Models in Temporal Comprehension
by: Jung, Minjoon, et al.
Published: (2024)

EgoLife: Towards Egocentric Life Assistant
by: Yang, Jingkang, et al.
Published: (2025)

Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis
by: Li, Lei-lei, et al.
Published: (2025)

Scene-Text Grounding for Text-Based Video Question Answering
by: Zhou, Sheng, et al.
Published: (2024)

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
by: Xiao, Junbin, et al.
Published: (2026)

VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation
by: Liao, Xinyao, et al.
Published: (2026)

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
by: Chen, Tianle, et al.
Published: (2025)

Building Egocentric Procedural AI Assistant: Methods, Benchmarks, and Challenges
by: Li, Junlong, et al.
Published: (2025)

VideoQA in the Era of LLMs: An Empirical Study
by: Xiao, Junbin, et al.
Published: (2024)

Intention-Conditioned Long-Term Human Egocentric Action Forecasting
by: Mascaro, Esteve Valls, et al.
Published: (2022)

EgoSelf: From Memory to Personalized Egocentric Assistant
by: Wang, Yanshuo, et al.
Published: (2026)

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
by: Jung, Minjoon, et al.
Published: (2025)

ALGO: Object-Grounded Visual Commonsense Reasoning for Open-World Egocentric Action Recognition
by: Kundu, Sanjoy, et al.
Published: (2024)

REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization
by: He, Qiyuan, et al.
Published: (2025)

Grounded Question-Answering in Long Egocentric Videos
by: Di, Shangzhe, et al.
Published: (2023)

AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation
by: Zhu, Jiayin, et al.
Published: (2025)

Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition
by: Li, Xunsong, et al.
Published: (2024)

Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
by: Huang, Yifei, et al.
Published: (2024)

HRVDA: High-Resolution Visual Document Assistant
by: Liu, Chaohu, et al.
Published: (2024)

Fine-grained Spatiotemporal Grounding on Egocentric Videos
by: Liang, Shuo, et al.
Published: (2025)

Discovering Novel Actions from Open World Egocentric Videos with Object-Grounded Visual Commonsense Reasoning
by: Kundu, Sanjoy, et al.
Published: (2023)

WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos
by: Ye, Yufei, et al.
Published: (2026)