Saved in:
| Main Authors: | Ganesh, Aashutosh, Popa, Mirela, Odijk, Daan, Tintarev, Nava |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.03323 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas
by: Bretti, Carlo, et al.
Published: (2024)
by: Bretti, Carlo, et al.
Published: (2024)
A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback
by: Khaertdinov, Bulat, et al.
Published: (2025)
by: Khaertdinov, Bulat, et al.
Published: (2025)
Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding
by: Ma, Jingtian, et al.
Published: (2025)
by: Ma, Jingtian, et al.
Published: (2025)
VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
by: Meng, Jiahao, et al.
Published: (2026)
by: Meng, Jiahao, et al.
Published: (2026)
AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification
by: Tang, Wang, et al.
Published: (2025)
by: Tang, Wang, et al.
Published: (2025)
Decoupling Spatio-Temporal Adapter for Fine-Grained Badminton Action Localization
by: Wang, Tianyu, et al.
Published: (2026)
by: Wang, Tianyu, et al.
Published: (2026)
Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
by: Meng, Jiahao, et al.
Published: (2025)
by: Meng, Jiahao, et al.
Published: (2025)
Consistent and Invariant Generalization Learning for Short-video Misinformation Detection
by: Guo, Hanghui, et al.
Published: (2025)
by: Guo, Hanghui, et al.
Published: (2025)
Reviewing Intelligent Cinematography: AI research for camera-based video production
by: Azzarelli, Adrian, et al.
Published: (2024)
by: Azzarelli, Adrian, et al.
Published: (2024)
Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset
by: Ancarani, Elisa, et al.
Published: (2025)
by: Ancarani, Elisa, et al.
Published: (2025)
Subjective evaluation of UHD video coded using VVC with LCEVC and ML-VVC
by: Ramzan, Naeem, et al.
Published: (2026)
by: Ramzan, Naeem, et al.
Published: (2026)
EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model
by: Li, Deng, et al.
Published: (2024)
by: Li, Deng, et al.
Published: (2024)
Spatial-Temporal Human-Object Interaction Detection
by: Sun, Xu, et al.
Published: (2025)
by: Sun, Xu, et al.
Published: (2025)
Boosting Temporal Sentence Grounding via Causal Inference
by: Tang, Kefan, et al.
Published: (2025)
by: Tang, Kefan, et al.
Published: (2025)
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025)
by: Pramanick, Shraman, et al.
Published: (2025)
Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
by: Song, Shezheng, et al.
Published: (2026)
by: Song, Shezheng, et al.
Published: (2026)
Audio-visual training for improved grounding in video-text LLMs
by: Sagare, Shivprasad, et al.
Published: (2024)
by: Sagare, Shivprasad, et al.
Published: (2024)
Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding
by: Moradi, Morteza, et al.
Published: (2024)
by: Moradi, Morteza, et al.
Published: (2024)
SSNVC: Single Stream Neural Video Compression with Implicit Temporal Information
by: Wang, Feng, et al.
Published: (2024)
by: Wang, Feng, et al.
Published: (2024)
Towards Universal Modal Tracking with Online Dense Temporal Token Learning
by: Zheng, Yaozong, et al.
Published: (2025)
by: Zheng, Yaozong, et al.
Published: (2025)
Probabilistic Temporal Masked Attention for Cross-view Online Action Detection
by: Xie, Liping, et al.
Published: (2025)
by: Xie, Liping, et al.
Published: (2025)
Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling
by: Liu, Xinhang, et al.
Published: (2024)
by: Liu, Xinhang, et al.
Published: (2024)
Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion
by: Wang, Xinghan, et al.
Published: (2024)
by: Wang, Xinghan, et al.
Published: (2024)
Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution
by: Chen, Zhikai, et al.
Published: (2024)
by: Chen, Zhikai, et al.
Published: (2024)
SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding
by: Li, Wenrui, et al.
Published: (2024)
by: Li, Wenrui, et al.
Published: (2024)
Exposure Completing for Temporally Consistent Neural High Dynamic Range Video Rendering
by: Cui, Jiahao, et al.
Published: (2024)
by: Cui, Jiahao, et al.
Published: (2024)
Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction
by: Li, Dong, et al.
Published: (2025)
by: Li, Dong, et al.
Published: (2025)
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
by: Kong, Fanheng, et al.
Published: (2025)
by: Kong, Fanheng, et al.
Published: (2025)
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
by: Zhang, Zhongwei, et al.
Published: (2024)
by: Zhang, Zhongwei, et al.
Published: (2024)
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
by: Lan, Xiaohan, et al.
Published: (2024)
by: Lan, Xiaohan, et al.
Published: (2024)
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
by: Zhu, Sa, et al.
Published: (2026)
by: Zhu, Sa, et al.
Published: (2026)
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
by: Zhang, Zhenxing, et al.
Published: (2024)
by: Zhang, Zhenxing, et al.
Published: (2024)
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
by: Qu, Mengxue, et al.
Published: (2024)
by: Qu, Mengxue, et al.
Published: (2024)
Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization
by: Yin, Qilin, et al.
Published: (2025)
by: Yin, Qilin, et al.
Published: (2025)
Efficient Bitrate Ladder Construction using Transfer Learning and Spatio-Temporal Features
by: Falahati, Ali, et al.
Published: (2024)
by: Falahati, Ali, et al.
Published: (2024)
TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization
by: Liu, Ke, et al.
Published: (2025)
by: Liu, Ke, et al.
Published: (2025)
AKiRa: Augmentation Kit on Rays for optical video generation
by: Wang, Xi, et al.
Published: (2024)
by: Wang, Xi, et al.
Published: (2024)
TMFNet: Two-Stream Multi-Channels Fusion Networks for Color Image Operation Chain Detection
by: Niu, Yakun, et al.
Published: (2024)
by: Niu, Yakun, et al.
Published: (2024)
4D Gaussian Splatting with Scale-aware Residual Field and Adaptive Optimization for Real-time Rendering of Temporally Complex Dynamic Scenes
by: Yan, Jinbo, et al.
Published: (2024)
by: Yan, Jinbo, et al.
Published: (2024)
Similar Items
-
Find the Cliffhanger: Multi-Modal Trailerness in Soap Operas
by: Bretti, Carlo, et al.
Published: (2024) -
A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback
by: Khaertdinov, Bulat, et al.
Published: (2025) -
Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding
by: Ma, Jingtian, et al.
Published: (2025) -
VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
by: Meng, Jiahao, et al.
Published: (2026) -
AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification
by: Tang, Wang, et al.
Published: (2025)