Saved in:
| Main Author: | Zhong, Yutong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.17034 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation
by: Wei, Mingjie, et al.
Published: (2025)
by: Wei, Mingjie, et al.
Published: (2025)
Where, What, Why: Toward Explainable 3D-GS Watermarking
by: Cai, Mingshu, et al.
Published: (2026)
by: Cai, Mingshu, et al.
Published: (2026)
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
by: Zhang, Le, et al.
Published: (2026)
by: Zhang, Le, et al.
Published: (2026)
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
by: Salamatian, Ali, et al.
Published: (2026)
by: Salamatian, Ali, et al.
Published: (2026)
CVGL: Causal Learning and Geometric Topology
by: Ouyang, Songsong, et al.
Published: (2026)
by: Ouyang, Songsong, et al.
Published: (2026)
Show Me What and Where has Changed? Question Answering and Grounding for Remote Sensing Change Detection
by: Li, Ke, et al.
Published: (2024)
by: Li, Ke, et al.
Published: (2024)
LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
by: Fuller, Anthony, et al.
Published: (2025)
by: Fuller, Anthony, et al.
Published: (2025)
KitchenTwin: Semantically and Geometrically Grounded 3D Kitchen Digital Twins
by: Wu, Quanyun, et al.
Published: (2026)
by: Wu, Quanyun, et al.
Published: (2026)
Taming Video Diffusion Prior with Scene-Grounding Guidance for 3D Gaussian Splatting from Sparse Inputs
by: Zhong, Yingji, et al.
Published: (2025)
by: Zhong, Yingji, et al.
Published: (2025)
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
by: Guo, Yongxin, et al.
Published: (2024)
by: Guo, Yongxin, et al.
Published: (2024)
Context Consistency Learning via Sentence Removal for Semi-Supervised Video Paragraph Grounding
by: Zhong, Yaokun, et al.
Published: (2025)
by: Zhong, Yaokun, et al.
Published: (2025)
DecoFuse: Decomposing and Fusing the "What", "Where", and "How" for Brain-Inspired fMRI-to-Video Decoding
by: Li, Chong, et al.
Published: (2025)
by: Li, Chong, et al.
Published: (2025)
From Priors to Perception: Grounding Video-LLMs in Physical Reality
by: Zhao, Zicheng, et al.
Published: (2026)
by: Zhao, Zicheng, et al.
Published: (2026)
VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
by: Wang, Hanqing, et al.
Published: (2026)
by: Wang, Hanqing, et al.
Published: (2026)
CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation
by: Wang, Tong, et al.
Published: (2026)
by: Wang, Tong, et al.
Published: (2026)
When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
by: Fang, Pengcheng, et al.
Published: (2025)
by: Fang, Pengcheng, et al.
Published: (2025)
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
by: Guo, Yongxin, et al.
Published: (2024)
by: Guo, Yongxin, et al.
Published: (2024)
Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos
by: Dou, Weijia, et al.
Published: (2026)
by: Dou, Weijia, et al.
Published: (2026)
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025)
by: Pramanick, Shraman, et al.
Published: (2025)
Grab-3D: Detecting AI-Generated Videos from 3D Geometric Temporal Consistency
by: Chen, Wenhan, et al.
Published: (2025)
by: Chen, Wenhan, et al.
Published: (2025)
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
by: Cheng, Xiaoya, et al.
Published: (2026)
by: Cheng, Xiaoya, et al.
Published: (2026)
Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment
by: Gou, Dongqiang, et al.
Published: (2026)
by: Gou, Dongqiang, et al.
Published: (2026)
Fine-grained Spatiotemporal Grounding on Egocentric Videos
by: Liang, Shuo, et al.
Published: (2025)
by: Liang, Shuo, et al.
Published: (2025)
VAGNet: Grounding 3D Affordance from Human-Object Interactions in Videos
by: Mao, Aihua, et al.
Published: (2026)
by: Mao, Aihua, et al.
Published: (2026)
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
by: Yang, Zaiquan, et al.
Published: (2025)
by: Yang, Zaiquan, et al.
Published: (2025)
Decoupling What to Count and Where to See for Referring Expression Counting
by: Zou, Yuda, et al.
Published: (2025)
by: Zou, Yuda, et al.
Published: (2025)
Where, What, Why: Towards Explainable Driver Attention Prediction
by: Zhou, Yuchen, et al.
Published: (2025)
by: Zhou, Yuchen, et al.
Published: (2025)
Cross-modal Causal Relation Alignment for Video Question Grounding
by: Chen, Weixing, et al.
Published: (2025)
by: Chen, Weixing, et al.
Published: (2025)
GGPT: Geometry Grounded Point Transformer
by: Chen, Yutong, et al.
Published: (2026)
by: Chen, Yutong, et al.
Published: (2026)
Referencing Where to Focus: Improving VisualGrounding with Referential Query
by: Wang, Yabing, et al.
Published: (2024)
by: Wang, Yabing, et al.
Published: (2024)
Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
by: Li, Yan, et al.
Published: (2025)
by: Li, Yan, et al.
Published: (2025)
CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations
by: Lu, Nengbo, et al.
Published: (2026)
by: Lu, Nengbo, et al.
Published: (2026)
Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information
by: Di Giammarino, Luca, et al.
Published: (2024)
by: Di Giammarino, Luca, et al.
Published: (2024)
What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models
by: Deng, Tianchen, et al.
Published: (2025)
by: Deng, Tianchen, et al.
Published: (2025)
What-Meets-Where: Unified Learning of Action and Contact Localization in Images
by: Wang, Yuxiao, et al.
Published: (2025)
by: Wang, Yuxiao, et al.
Published: (2025)
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
by: Zhang, Mingfang, et al.
Published: (2026)
by: Zhang, Mingfang, et al.
Published: (2026)
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
by: Wang, Yueqian, et al.
Published: (2024)
by: Wang, Yueqian, et al.
Published: (2024)
Geometric Transformation-Embedded Mamba for Learned Video Compression
by: Wei, Hao, et al.
Published: (2026)
by: Wei, Hao, et al.
Published: (2026)
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
by: Jin, Shengji, et al.
Published: (2026)
by: Jin, Shengji, et al.
Published: (2026)
Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence
by: Chen, Yutong, et al.
Published: (2024)
by: Chen, Yutong, et al.
Published: (2024)
Similar Items
-
Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation
by: Wei, Mingjie, et al.
Published: (2025) -
Where, What, Why: Toward Explainable 3D-GS Watermarking
by: Cai, Mingshu, et al.
Published: (2026) -
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
by: Zhang, Le, et al.
Published: (2026) -
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
by: Salamatian, Ali, et al.
Published: (2026) -
CVGL: Causal Learning and Geometric Topology
by: Ouyang, Songsong, et al.
Published: (2026)