:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	So, Yerim, Kim, Jiyeong, Yoon, Jiwon, Min, Dongbo
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.23288
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision
by: Kim, Jiyeong, et al.
Published: (2026)

Open-Vocabulary Spatio-Temporal Action Detection
by: Wu, Tao, et al.
Published: (2024)

Boundary-Recovering Network for Temporal Action Detection
by: Kim, Jihwan, et al.
Published: (2024)

Emerging Property of Masked Token for Effective Pre-training
by: Choi, Hyesong, et al.
Published: (2024)

RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation
by: Patel, Naman, et al.
Published: (2025)

DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition
by: Cheng, Haozhe, et al.
Published: (2024)

Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
by: Kim, Yehna, et al.
Published: (2025)

Modelling Spatio-Temporal Interactions For Compositional Action Recognition
by: Rajendiran, Ramanathan, et al.
Published: (2023)

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
by: Wasim, Syed Talal, et al.
Published: (2023)

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
by: Yuan, Zhenlong, et al.
Published: (2025)

Learning to Generalize without Bias for Open-Vocabulary Action Recognition
by: Yu, Yating, et al.
Published: (2025)

Open-Vocabulary Temporal Action Localization using Multimodal Guidance
by: Gupta, Akshita, et al.
Published: (2024)

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization
by: Hyun, Jeongseok, et al.
Published: (2024)

One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features
by: Nguyen, Trung Thanh, et al.
Published: (2024)

CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
by: Cho, Seokju, et al.
Published: (2023)

MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer
by: Yamane, Taiga, et al.
Published: (2025)

Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition
by: Gunasekara, Shanaka Ramesh, et al.
Published: (2025)

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection
by: Bao, Wentao, et al.
Published: (2024)

Scaling Open-Vocabulary Action Detection
by: Sia, Zhen Hao, et al.
Published: (2025)

FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
by: Park, Hyejin, et al.
Published: (2026)

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition
by: Lin, Kun-Yu, et al.
Published: (2024)

Leveraging Temporal Contextualization for Video Action Recognition
by: Kim, Minji, et al.
Published: (2024)

Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
by: Liu, Yong, et al.
Published: (2025)

Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection
by: Zhu, Sa, et al.
Published: (2026)

MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization
by: Fang, Zhenying, et al.
Published: (2025)

FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition
by: Huang, Xiaohu, et al.
Published: (2024)

Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
by: Zhu, Sa, et al.
Published: (2026)

StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales
by: Siddiqui, Nyle, et al.
Published: (2025)

UniSTFormer: Unified Spatio-Temporal Lightweight Transformer for Efficient Skeleton-Based Action Recognition
by: Wu, Wenhan, et al.
Published: (2025)

Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
by: Levi, Hila, et al.
Published: (2023)

HERO: Hierarchical Embedding-Refinement for Open-Vocabulary Temporal Sentence Grounding in Videos
by: Han, Tingting, et al.
Published: (2026)

Spatio-Temporal Context Prompting for Zero-Shot Action Detection
by: Huang, Wei-Jhe, et al.
Published: (2024)

A Decoding Scheme with Successive Aggregation of Multi-Level Features for Light-Weight Semantic Segmentation
by: Yoo, Jiwon, et al.
Published: (2024)

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition
by: Pei, Wenjie, et al.
Published: (2023)

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
by: Ullah, Hayat, et al.
Published: (2025)

Dynamic Guidance Adversarial Distillation with Enhanced Teacher Knowledge
by: Park, Hyejin, et al.
Published: (2024)

Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition
by: Lee, Sumin, et al.
Published: (2024)

OVMR: Open-Vocabulary Recognition with Multi-Modal References
by: Ma, Zehong, et al.
Published: (2024)

SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
by: Do, Jeonghyeok, et al.
Published: (2024)

Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation
by: Zheng, Yanhao, et al.
Published: (2024)