Saved in:
| Main Authors: | So, Yerim, Kim, Jiyeong, Yoon, Jiwon, Min, Dongbo |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.23288 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
by: Kim, Jiyeong, et al.
Published: (2026)
by: Kim, Jiyeong, et al.
Published: (2026)
Open-Vocabulary Spatio-Temporal Action Detection
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
Boundary-Recovering Network for Temporal Action Detection
by: Kim, Jihwan, et al.
Published: (2024)
by: Kim, Jihwan, et al.
Published: (2024)
Emerging Property of Masked Token for Effective Pre-training
by: Choi, Hyesong, et al.
Published: (2024)
by: Choi, Hyesong, et al.
Published: (2024)
RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
by: Patel, Naman, et al.
Published: (2025)
by: Patel, Naman, et al.
Published: (2025)
DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition
by: Cheng, Haozhe, et al.
Published: (2024)
by: Cheng, Haozhe, et al.
Published: (2024)
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
by: Kim, Yehna, et al.
Published: (2025)
by: Kim, Yehna, et al.
Published: (2025)
Modelling Spatio-Temporal Interactions For Compositional Action Recognition
by: Rajendiran, Ramanathan, et al.
Published: (2023)
by: Rajendiran, Ramanathan, et al.
Published: (2023)
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
by: Wasim, Syed Talal, et al.
Published: (2023)
by: Wasim, Syed Talal, et al.
Published: (2023)
Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
by: Yuan, Zhenlong, et al.
Published: (2025)
by: Yuan, Zhenlong, et al.
Published: (2025)
Learning to Generalize without Bias for Open-Vocabulary Action Recognition
by: Yu, Yating, et al.
Published: (2025)
by: Yu, Yating, et al.
Published: (2025)
Open-Vocabulary Temporal Action Localization using Multimodal Guidance
by: Gupta, Akshita, et al.
Published: (2024)
by: Gupta, Akshita, et al.
Published: (2024)
Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization
by: Hyun, Jeongseok, et al.
Published: (2024)
by: Hyun, Jeongseok, et al.
Published: (2024)
One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features
by: Nguyen, Trung Thanh, et al.
Published: (2024)
by: Nguyen, Trung Thanh, et al.
Published: (2024)
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
by: Cho, Seokju, et al.
Published: (2023)
by: Cho, Seokju, et al.
Published: (2023)
MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer
by: Yamane, Taiga, et al.
Published: (2025)
by: Yamane, Taiga, et al.
Published: (2025)
Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition
by: Gunasekara, Shanaka Ramesh, et al.
Published: (2025)
by: Gunasekara, Shanaka Ramesh, et al.
Published: (2025)
Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection
by: Bao, Wentao, et al.
Published: (2024)
by: Bao, Wentao, et al.
Published: (2024)
Scaling Open-Vocabulary Action Detection
by: Sia, Zhen Hao, et al.
Published: (2025)
by: Sia, Zhen Hao, et al.
Published: (2025)
FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
by: Park, Hyejin, et al.
Published: (2026)
by: Park, Hyejin, et al.
Published: (2026)
Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition
by: Lin, Kun-Yu, et al.
Published: (2024)
by: Lin, Kun-Yu, et al.
Published: (2024)
Leveraging Temporal Contextualization for Video Action Recognition
by: Kim, Minji, et al.
Published: (2024)
by: Kim, Minji, et al.
Published: (2024)
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
by: Liu, Yong, et al.
Published: (2025)
by: Liu, Yong, et al.
Published: (2025)
Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection
by: Zhu, Sa, et al.
Published: (2026)
by: Zhu, Sa, et al.
Published: (2026)
MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization
by: Fang, Zhenying, et al.
Published: (2025)
by: Fang, Zhenying, et al.
Published: (2025)
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition
by: Huang, Xiaohu, et al.
Published: (2024)
by: Huang, Xiaohu, et al.
Published: (2024)
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
by: Zhu, Sa, et al.
Published: (2026)
by: Zhu, Sa, et al.
Published: (2026)
StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
by: Siddiqui, Nyle, et al.
Published: (2025)
by: Siddiqui, Nyle, et al.
Published: (2025)
UniSTFormer: Unified Spatio-Temporal Lightweight Transformer for Efficient Skeleton-Based Action Recognition
by: Wu, Wenhan, et al.
Published: (2025)
by: Wu, Wenhan, et al.
Published: (2025)
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
by: Levi, Hila, et al.
Published: (2023)
by: Levi, Hila, et al.
Published: (2023)
HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
by: Han, Tingting, et al.
Published: (2026)
by: Han, Tingting, et al.
Published: (2026)
Spatio-Temporal Context Prompting for Zero-Shot Action Detection
by: Huang, Wei-Jhe, et al.
Published: (2024)
by: Huang, Wei-Jhe, et al.
Published: (2024)
A Decoding Scheme with Successive Aggregation of Multi-Level Features for Light-Weight Semantic Segmentation
by: Yoo, Jiwon, et al.
Published: (2024)
by: Yoo, Jiwon, et al.
Published: (2024)
D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition
by: Pei, Wenjie, et al.
Published: (2023)
by: Pei, Wenjie, et al.
Published: (2023)
DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
by: Ullah, Hayat, et al.
Published: (2025)
by: Ullah, Hayat, et al.
Published: (2025)
Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge
by: Park, Hyejin, et al.
Published: (2024)
by: Park, Hyejin, et al.
Published: (2024)
Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition
by: Lee, Sumin, et al.
Published: (2024)
by: Lee, Sumin, et al.
Published: (2024)
OVMR: Open-Vocabulary Recognition with Multi-Modal References
by: Ma, Zehong, et al.
Published: (2024)
by: Ma, Zehong, et al.
Published: (2024)
SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
by: Do, Jeonghyeok, et al.
Published: (2024)
by: Do, Jeonghyeok, et al.
Published: (2024)
Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation
by: Zheng, Yanhao, et al.
Published: (2024)
by: Zheng, Yanhao, et al.
Published: (2024)
Similar Items
-
Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
by: Kim, Jiyeong, et al.
Published: (2026) -
Open-Vocabulary Spatio-Temporal Action Detection
by: Wu, Tao, et al.
Published: (2024) -
Boundary-Recovering Network for Temporal Action Detection
by: Kim, Jihwan, et al.
Published: (2024) -
Emerging Property of Masked Token for Effective Pre-training
by: Choi, Hyesong, et al.
Published: (2024) -
RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
by: Patel, Naman, et al.
Published: (2025)