:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Qiu, Tianheng, Gao, Jingchun, Li, Jingyu, Leong, Huiyi, Huang, Xuan, Wang, Xi, Zhang, Xiaocheng, Xu, Kele, Zhang, Lan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.18531
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
by: Tu, Xuezhen, et al.
Published: (2026)

The Geometry of Cortical Computation: Manifold Disentanglement and Predictive Dynamics in VCNet
by: Hill, Brennen A., et al.
Published: (2025)

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
by: Fu, Honghao, et al.
Published: (2026)

MIAT: Maneuver-Intention-Aware Transformer for Spatio-Temporal Trajectory Prediction
by: Raskoti, Chandra, et al.
Published: (2025)

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
by: Fiastre, Gabriel, et al.
Published: (2025)

ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
by: Xu, Qi'ao, et al.
Published: (2025)

NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
by: Nadeem, Asmar, et al.
Published: (2024)

Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
by: Zhang, Xu, et al.
Published: (2026)

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning
by: Yang, Jingchun, et al.
Published: (2026)

SOVC: Subject-Oriented Video Captioning
by: Teng, Chang, et al.
Published: (2023)

DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration
by: Chen, Zheng, et al.
Published: (2026)

Dual-path Collaborative Generation Network for Emotional Video Captioning
by: Ye, Cheng, et al.
Published: (2024)

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios
by: Gao, Hong, et al.
Published: (2025)

Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering
by: Liang, Lili, et al.
Published: (2024)

PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation
by: Wang, Liuyi, et al.
Published: (2023)

Bridging the Intent Gap: Knowledge-Enhanced Visual Generation
by: Cheng, Yi, et al.
Published: (2024)

HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking
by: Deng, Yao, et al.
Published: (2025)

FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding
by: Guo, Yanan, et al.
Published: (2025)

Context-Guided Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2024)

IntentionNav: A Benchmark for Intent-Driven Object Navigation from Implicit Human Instruction
by: Qian, Lin, et al.
Published: (2026)

Towards Long-Form Spatio-Temporal Video Grounding
by: Gu, Xin, et al.
Published: (2026)

Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution
by: An, Hongyu, et al.
Published: (2024)

Consistent multiple-relaxation-time lattice Boltzmann method for the volume averaged Navier-Stokes equations
by: Liu, Yang, et al.
Published: (2024)

CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning
by: Gao, Zijun, et al.
Published: (2025)

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
by: Fei, Hao, et al.
Published: (2024)

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
by: Sarto, Sara, et al.
Published: (2024)

Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning
by: Deng, Jiewen, et al.
Published: (2024)

Video-Language Alignment via Spatio-Temporal Graph Transformer
by: Zhang, Shi-Xue, et al.
Published: (2024)

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
by: Wu, Shengqiong, et al.
Published: (2025)

Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
by: Li, Jia, et al.
Published: (2026)

Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain
by: Chao, Lianying, et al.
Published: (2026)

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph
by: Wang, Wentao, et al.
Published: (2025)

Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring
by: Gao, Xin, et al.
Published: (2023)

Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection
by: Shen, Hao, et al.
Published: (2024)

Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
by: Yanuka, Moran, et al.
Published: (2024)

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation
by: Xing, Yun, et al.
Published: (2023)

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
by: Xu, Zhou, et al.
Published: (2026)

LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors
by: Gao, Wenshuo, et al.
Published: (2025)

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation
by: Li, Qirui, et al.
Published: (2025)

OmniSTVG: Toward Spatio-Temporal Omni-Object Video Grounding
by: Yao, Jiali, et al.
Published: (2025)