:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Caihua, Li, Xu, Xue, Wenjing, Tang, Wei, Feng, Xia
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.13754
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
by: Sun, Licai, et al.
Published: (2025)

Progress-Aware Video Frame Captioning
by: Xue, Zihui, et al.
Published: (2024)

Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
by: Zhang, Xu, et al.
Published: (2026)

Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation
by: Cen, Zhigang, et al.
Published: (2024)

Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations
by: Wang, Meng, et al.
Published: (2025)

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
by: Yao, Linli, et al.
Published: (2026)

Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning
by: Liu, Maofu, et al.
Published: (2025)

Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction
by: Jia, Mingda, et al.
Published: (2025)

DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
by: Xu, Dongsheng, et al.
Published: (2023)

HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning
by: Wang, Man, et al.
Published: (2026)

Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
by: Lu, Yifan, et al.
Published: (2023)

SGCap: Decoding Semantic Group for Zero-shot Video Captioning
by: Pan, Zeyu, et al.
Published: (2025)

Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
by: Tang, Yunlong, et al.
Published: (2025)

Retrieval-Augmented Egocentric Video Captioning
by: Xu, Jilan, et al.
Published: (2024)

Dense Video Captioning Using Unsupervised Semantic Information
by: Estevam, Valter, et al.
Published: (2021)

Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning
by: Jeon, MinJu, et al.
Published: (2025)

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
by: Guo, Wenxuan, et al.
Published: (2026)

Capturing Context-Aware Route Choice Semantics for Trajectory Representation Learning
by: Cao, Ji, et al.
Published: (2025)

Video-Language Alignment via Spatio-Temporal Graph Transformer
by: Zhang, Shi-Xue, et al.
Published: (2024)

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
by: Chen, Lin, et al.
Published: (2024)

Video Summarization: Towards Entity-Aware Captions
by: Ayyubi, Hammad A., et al.
Published: (2023)

Technical Report for Soccernet 2023 -- Dense Video Captioning
by: Ruan, Zheng, et al.
Published: (2024)

A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection
by: Korban, Matthew, et al.
Published: (2024)

UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions
by: Xue, Zhucun, et al.
Published: (2025)

HabitAction: A Video Dataset for Human Habitual Behavior Recognition
by: Li, Hongwu, et al.
Published: (2024)

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
by: Zhao, Ruixiang, et al.
Published: (2026)

Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening
by: Bagad, Piyush, et al.
Published: (2025)

SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
by: Jang, Wonsuk, et al.
Published: (2026)

Addressing the ID-Matching Challenge in Long Video Captioning
by: Yang, Zhantao, et al.
Published: (2025)

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
by: Hu, Bing, et al.
Published: (2026)

Dense Video Captioning using Graph-based Sentence Summarization
by: Zhang, Zhiwang, et al.
Published: (2025)

Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
by: Huang, Wei-Jin, et al.
Published: (2025)

UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
by: Wu, Peiran, et al.
Published: (2025)

GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration
by: Xu, Wan, et al.
Published: (2025)

Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition
by: Liang, Zeyu, et al.
Published: (2025)

HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation
by: Xu, Guoan, et al.
Published: (2024)

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
by: Shvetsova, Nina, et al.
Published: (2023)

AVC-DPO: Aligned Video Captioning via Direct Preference Optimization
by: Tang, Jiyang, et al.
Published: (2025)

VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
by: Zhang, Shi-Xue, et al.
Published: (2025)

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning
by: Chao, Lianying, et al.
Published: (2026)