:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Zhong, Yutong
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.17034
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation
by: Wei, Mingjie, et al.
Published: (2025)

Where, What, Why: Toward Explainable 3D-GS Watermarking
by: Cai, Mingshu, et al.
Published: (2026)

From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
by: Zhang, Le, et al.
Published: (2026)

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
by: Salamatian, Ali, et al.
Published: (2026)

CVGL: Causal Learning and Geometric Topology
by: Ouyang, Songsong, et al.
Published: (2026)

Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection
by: Li, Ke, et al.
Published: (2024)

LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
by: Fuller, Anthony, et al.
Published: (2025)

KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins
by: Wu, Quanyun, et al.
Published: (2026)

Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs
by: Zhong, Yingji, et al.
Published: (2025)

TRACE: Temporal Grounding Video LLM via Causal Event Modeling
by: Guo, Yongxin, et al.
Published: (2024)

Context Consistency Learning via Sentence Removal for Semi-Supervised Video Paragraph Grounding
by: Zhong, Yaokun, et al.
Published: (2025)

DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding
by: Li, Chong, et al.
Published: (2025)

From Priors to Perception: Grounding Video-LLMs in Physical Reality
by: Zhao, Zicheng, et al.
Published: (2026)

VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
by: Wang, Hanqing, et al.
Published: (2026)

CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation
by: Wang, Tong, et al.
Published: (2026)

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
by: Fang, Pengcheng, et al.
Published: (2025)

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
by: Guo, Yongxin, et al.
Published: (2024)

Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos
by: Dou, Weijia, et al.
Published: (2026)

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025)

Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency
by: Chen, Wenhan, et al.
Published: (2025)

AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
by: Cheng, Xiaoya, et al.
Published: (2026)

Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment
by: Gou, Dongqiang, et al.
Published: (2026)

Fine-grained Spatiotemporal Grounding on Egocentric Videos
by: Liang, Shuo, et al.
Published: (2025)

VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos
by: Mao, Aihua, et al.
Published: (2026)

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
by: Yang, Zaiquan, et al.
Published: (2025)

Decoupling What to Count and Where to See for Referring Expression Counting
by: Zou, Yuda, et al.
Published: (2025)

Where, What, Why: Towards Explainable Driver Attention Prediction
by: Zhou, Yuchen, et al.
Published: (2025)

Cross-modal Causal Relation Alignment for Video Question Grounding
by: Chen, Weixing, et al.
Published: (2025)

GGPT: Geometry Grounded Point Transformer
by: Chen, Yutong, et al.
Published: (2026)

Referencing Where to Focus: Improving VisualGrounding with Referential Query
by: Wang, Yabing, et al.
Published: (2024)

Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
by: Li, Yan, et al.
Published: (2025)

CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations
by: Lu, Nengbo, et al.
Published: (2026)

Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information
by: Di Giammarino, Luca, et al.
Published: (2024)

What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models
by: Deng, Tianchen, et al.
Published: (2025)

What-Meets-Where: Unified Learning of Action and Contact Localization in Images
by: Wang, Yuxiao, et al.
Published: (2025)

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
by: Zhang, Mingfang, et al.
Published: (2026)

HawkEye: Training Video-Text LLMs for Grounding Text in Videos
by: Wang, Yueqian, et al.
Published: (2024)

Geometric Transformation-Embedded Mamba for Learned Video Compression
by: Wei, Hao, et al.
Published: (2026)

How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
by: Jin, Shengji, et al.
Published: (2026)

Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence
by: Chen, Yutong, et al.
Published: (2024)