:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tu, Xuezhen, Wu, Jingyu, Kang, Fangyu, Nong, Qingpeng, Zhang, Kaijin, Niu, Chaoyue, Wu, Fan
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.08014
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Context-Guided Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2024)

Towards Long-Form Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2026)

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
by: Gu, Xin, et al.
Published: (2025)

IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning
by: Qiu, Tianheng, et al.
Published: (2025)

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
by: Yao, Jiali, et al.
Published: (2025)

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
by: Fei, Hao, et al.
Published: (2024)

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
by: Gao, Hong, et al.
Published: (2025)

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
by: Wang, Jiankang, et al.
Published: (2025)

UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking
by: Liang, Qihua, et al.
Published: (2026)

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
by: Wasim, Syed Talal, et al.
Published: (2023)

Video-Language Alignment via Spatio-Temporal Graph Transformer
by: Zhang, Shi-Xue, et al.
Published: (2024)

VideoMolmo: Spatio-Temporal Grounding Meets Pointing
by: Ahmad, Ghazi Shazan, et al.
Published: (2025)

SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos
by: Jiao, Yingying, et al.
Published: (2025)

SegDebias: Test-Time Bias Mitigation for ViT-Based CLIP via Segmentation
by: Wu, Fangyu, et al.
Published: (2025)

STDR: Spatio-Temporal Decoupling for Real-Time Dynamic Scene Rendering
by: Li, Zehao, et al.
Published: (2025)

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
by: Zhang, Mingfang, et al.
Published: (2026)

Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception
by: Li, Xiaoyu, et al.
Published: (2025)

Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding
by: Gao, Shida, et al.
Published: (2025)

Decoupling Spatio-Temporal Adapter for Fine-Grained Badminton Action Localization
by: Wang, Tianyu, et al.
Published: (2026)

VideoMamba: Spatio-Temporal Selective State Space Model
by: Park, Jinyoung, et al.
Published: (2024)

Static and Dynamic Graph Alignment Network for Temporal Video Grounding
by: Hu, Zhanjie, et al.
Published: (2026)

Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm
by: Wu, Yi, et al.
Published: (2024)

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval
by: Gao, Hong, et al.
Published: (2025)

Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection
by: Wang, Qishun, et al.
Published: (2025)

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2025)

DeRA: Decoupled Representation Alignment for Video Tokenization
by: Guo, Pengbo, et al.
Published: (2025)

NCSTR: Node-Centric Decoupled Spatio-Temporal Reasoning for Video-based Human Pose Estimation
by: Huynh, Quang Dang, et al.
Published: (2026)

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
by: Yang, Zaiquan, et al.
Published: (2025)

Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding
by: Kumar, Akash, et al.
Published: (2025)

Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning
by: Sugandhika, Chinthani, et al.
Published: (2025)

MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling
by: Zhang, Yue, et al.
Published: (2024)

SpatioTemporal Difference Network for Video Depth Super-Resolution
by: Wang, Zhengxue, et al.
Published: (2025)

Enhanced Textual Feature Extraction for Visual Question Answering: A Simple Convolutional Approach
by: Zhang, Zhilin, et al.
Published: (2024)

STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment Fusion
by: Yao, Wei, et al.
Published: (2024)

TRACE: Temporal Grounding Video LLM via Causal Event Modeling
by: Guo, Yongxin, et al.
Published: (2024)

Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory
by: Zhu, Zhengtong, et al.
Published: (2026)

Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning
by: Zhou, Zhiqiang, et al.
Published: (2026)

Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model
by: Xin, Zewei, et al.
Published: (2024)

Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking
by: Zheng, Yaozong, et al.
Published: (2025)

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
by: Xu, Qi'ao, et al.
Published: (2025)