Saved in:
| Main Authors: | Tu, Xuezhen, Wu, Jingyu, Kang, Fangyu, Nong, Qingpeng, Zhang, Kaijin, Niu, Chaoyue, Wu, Fan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.08014 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Context-Guided Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2024)
by: Gu, Xin, et al.
Published: (2024)
Towards Long-Form Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2026)
by: Gu, Xin, et al.
Published: (2026)
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
by: Gu, Xin, et al.
Published: (2025)
by: Gu, Xin, et al.
Published: (2025)
IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning
by: Qiu, Tianheng, et al.
Published: (2025)
by: Qiu, Tianheng, et al.
Published: (2025)
OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
by: Yao, Jiali, et al.
Published: (2025)
by: Yao, Jiali, et al.
Published: (2025)
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
by: Fei, Hao, et al.
Published: (2024)
by: Fei, Hao, et al.
Published: (2024)
OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
by: Gao, Hong, et al.
Published: (2025)
by: Gao, Hong, et al.
Published: (2025)
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
by: Wang, Jiankang, et al.
Published: (2025)
by: Wang, Jiankang, et al.
Published: (2025)
UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking
by: Liang, Qihua, et al.
Published: (2026)
by: Liang, Qihua, et al.
Published: (2026)
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
by: Wasim, Syed Talal, et al.
Published: (2023)
by: Wasim, Syed Talal, et al.
Published: (2023)
Video-Language Alignment via Spatio-Temporal Graph Transformer
by: Zhang, Shi-Xue, et al.
Published: (2024)
by: Zhang, Shi-Xue, et al.
Published: (2024)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
by: Ahmad, Ghazi Shazan, et al.
Published: (2025)
by: Ahmad, Ghazi Shazan, et al.
Published: (2025)
SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos
by: Jiao, Yingying, et al.
Published: (2025)
by: Jiao, Yingying, et al.
Published: (2025)
SegDebias: Test-Time Bias Mitigation for ViT-Based CLIP via Segmentation
by: Wu, Fangyu, et al.
Published: (2025)
by: Wu, Fangyu, et al.
Published: (2025)
STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering
by: Li, Zehao, et al.
Published: (2025)
by: Li, Zehao, et al.
Published: (2025)
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
by: Zhang, Mingfang, et al.
Published: (2026)
by: Zhang, Mingfang, et al.
Published: (2026)
Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception
by: Li, Xiaoyu, et al.
Published: (2025)
by: Li, Xiaoyu, et al.
Published: (2025)
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding
by: Gao, Shida, et al.
Published: (2025)
by: Gao, Shida, et al.
Published: (2025)
Decoupling Spatio-Temporal Adapter for Fine-Grained Badminton Action Localization
by: Wang, Tianyu, et al.
Published: (2026)
by: Wang, Tianyu, et al.
Published: (2026)
VideoMamba: Spatio-Temporal Selective State Space Model
by: Park, Jinyoung, et al.
Published: (2024)
by: Park, Jinyoung, et al.
Published: (2024)
Static and Dynamic Graph Alignment Network for Temporal Video Grounding
by: Hu, Zhanjie, et al.
Published: (2026)
by: Hu, Zhanjie, et al.
Published: (2026)
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm
by: Wu, Yi, et al.
Published: (2024)
by: Wu, Yi, et al.
Published: (2024)
APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval
by: Gao, Hong, et al.
Published: (2025)
by: Gao, Hong, et al.
Published: (2025)
Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection
by: Wang, Qishun, et al.
Published: (2025)
by: Wang, Qishun, et al.
Published: (2025)
Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2025)
by: Gu, Xin, et al.
Published: (2025)
DeRA: Decoupled Representation Alignment for Video Tokenization
by: Guo, Pengbo, et al.
Published: (2025)
by: Guo, Pengbo, et al.
Published: (2025)
NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation
by: Huynh, Quang Dang, et al.
Published: (2026)
by: Huynh, Quang Dang, et al.
Published: (2026)
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
by: Yang, Zaiquan, et al.
Published: (2025)
by: Yang, Zaiquan, et al.
Published: (2025)
Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding
by: Kumar, Akash, et al.
Published: (2025)
by: Kumar, Akash, et al.
Published: (2025)
Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
by: Sugandhika, Chinthani, et al.
Published: (2025)
by: Sugandhika, Chinthani, et al.
Published: (2025)
MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling
by: Zhang, Yue, et al.
Published: (2024)
by: Zhang, Yue, et al.
Published: (2024)
SpatioTemporal Difference Network for Video Depth Super-Resolution
by: Wang, Zhengxue, et al.
Published: (2025)
by: Wang, Zhengxue, et al.
Published: (2025)
Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach
by: Zhang, Zhilin, et al.
Published: (2024)
by: Zhang, Zhilin, et al.
Published: (2024)
STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion
by: Yao, Wei, et al.
Published: (2024)
by: Yao, Wei, et al.
Published: (2024)
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
by: Guo, Yongxin, et al.
Published: (2024)
by: Guo, Yongxin, et al.
Published: (2024)
Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory
by: Zhu, Zhengtong, et al.
Published: (2026)
by: Zhu, Zhengtong, et al.
Published: (2026)
Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning
by: Zhou, Zhiqiang, et al.
Published: (2026)
by: Zhou, Zhiqiang, et al.
Published: (2026)
Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model
by: Xin, Zewei, et al.
Published: (2024)
by: Xin, Zewei, et al.
Published: (2024)
Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking
by: Zheng, Yaozong, et al.
Published: (2025)
by: Zheng, Yaozong, et al.
Published: (2025)
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
by: Xu, Qi'ao, et al.
Published: (2025)
by: Xu, Qi'ao, et al.
Published: (2025)
Similar Items
-
Context-Guided Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2024) -
Towards Long-Form Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2026) -
Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
by: Gu, Xin, et al.
Published: (2025) -
IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning
by: Qiu, Tianheng, et al.
Published: (2025) -
OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
by: Yao, Jiali, et al.
Published: (2025)