Saved in:
| Main Authors: | Liu, Caihua, Li, Xu, Xue, Wenjing, Tang, Wei, Feng, Xia |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.13754 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
by: Sun, Licai, et al.
Published: (2025)
by: Sun, Licai, et al.
Published: (2025)
Progress-Aware Video Frame Captioning
by: Xue, Zihui, et al.
Published: (2024)
by: Xue, Zihui, et al.
Published: (2024)
Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
by: Zhang, Xu, et al.
Published: (2026)
by: Zhang, Xu, et al.
Published: (2026)
Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation
by: Cen, Zhigang, et al.
Published: (2024)
by: Cen, Zhigang, et al.
Published: (2024)
Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations
by: Wang, Meng, et al.
Published: (2025)
by: Wang, Meng, et al.
Published: (2025)
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
by: Yao, Linli, et al.
Published: (2026)
by: Yao, Linli, et al.
Published: (2026)
Semantic-Spatial Feature Fusion with Dynamic Graph Refinement for Remote Sensing Image Captioning
by: Liu, Maofu, et al.
Published: (2025)
by: Liu, Maofu, et al.
Published: (2025)
Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction
by: Jia, Mingda, et al.
Published: (2025)
by: Jia, Mingda, et al.
Published: (2025)
DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
by: Xu, Dongsheng, et al.
Published: (2023)
by: Xu, Dongsheng, et al.
Published: (2023)
HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning
by: Wang, Man, et al.
Published: (2026)
by: Wang, Man, et al.
Published: (2026)
Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
by: Lu, Yifan, et al.
Published: (2023)
by: Lu, Yifan, et al.
Published: (2023)
SGCap: Decoding Semantic Group for Zero-shot Video Captioning
by: Pan, Zeyu, et al.
Published: (2025)
by: Pan, Zeyu, et al.
Published: (2025)
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting
by: Tang, Yunlong, et al.
Published: (2025)
by: Tang, Yunlong, et al.
Published: (2025)
Retrieval-Augmented Egocentric Video Captioning
by: Xu, Jilan, et al.
Published: (2024)
by: Xu, Jilan, et al.
Published: (2024)
Dense Video Captioning Using Unsupervised Semantic Information
by: Estevam, Valter, et al.
Published: (2021)
by: Estevam, Valter, et al.
Published: (2021)
Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning
by: Jeon, MinJu, et al.
Published: (2025)
by: Jeon, MinJu, et al.
Published: (2025)
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
by: Guo, Wenxuan, et al.
Published: (2026)
by: Guo, Wenxuan, et al.
Published: (2026)
Capturing Context-Aware Route Choice Semantics for Trajectory Representation Learning
by: Cao, Ji, et al.
Published: (2025)
by: Cao, Ji, et al.
Published: (2025)
Video-Language Alignment via Spatio-Temporal Graph Transformer
by: Zhang, Shi-Xue, et al.
Published: (2024)
by: Zhang, Shi-Xue, et al.
Published: (2024)
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
by: Chen, Lin, et al.
Published: (2024)
by: Chen, Lin, et al.
Published: (2024)
Video Summarization: Towards Entity-Aware Captions
by: Ayyubi, Hammad A., et al.
Published: (2023)
by: Ayyubi, Hammad A., et al.
Published: (2023)
Technical Report for Soccernet 2023 -- Dense Video Captioning
by: Ruan, Zheng, et al.
Published: (2024)
by: Ruan, Zheng, et al.
Published: (2024)
A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection
by: Korban, Matthew, et al.
Published: (2024)
by: Korban, Matthew, et al.
Published: (2024)
UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions
by: Xue, Zhucun, et al.
Published: (2025)
by: Xue, Zhucun, et al.
Published: (2025)
HabitAction: A Video Dataset for Human Habitual Behavior Recognition
by: Li, Hongwu, et al.
Published: (2024)
by: Li, Hongwu, et al.
Published: (2024)
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
by: Zhao, Ruixiang, et al.
Published: (2026)
by: Zhao, Ruixiang, et al.
Published: (2026)
Chirality in Action: Time-Aware Video Representation Learning by Latent Straightening
by: Bagad, Piyush, et al.
Published: (2025)
by: Bagad, Piyush, et al.
Published: (2025)
SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
by: Jang, Wonsuk, et al.
Published: (2026)
by: Jang, Wonsuk, et al.
Published: (2026)
Addressing the ID-Matching Challenge in Long Video Captioning
by: Yang, Zhantao, et al.
Published: (2025)
by: Yang, Zhantao, et al.
Published: (2025)
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
by: Hu, Bing, et al.
Published: (2026)
by: Hu, Bing, et al.
Published: (2026)
Dense Video Captioning using Graph-based Sentence Summarization
by: Zhang, Zhiwang, et al.
Published: (2025)
by: Zhang, Zhiwang, et al.
Published: (2025)
Modeling Multiple Normal Action Representations for Error Detection in Procedural Tasks
by: Huang, Wei-Jin, et al.
Published: (2025)
by: Huang, Wei-Jin, et al.
Published: (2025)
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
by: Wu, Peiran, et al.
Published: (2025)
by: Wu, Peiran, et al.
Published: (2025)
GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration
by: Xu, Wan, et al.
Published: (2025)
by: Xu, Wan, et al.
Published: (2025)
Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition
by: Liang, Zeyu, et al.
Published: (2025)
by: Liang, Zeyu, et al.
Published: (2025)
HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation
by: Xu, Guoan, et al.
Published: (2024)
by: Xu, Guoan, et al.
Published: (2024)
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
by: Shvetsova, Nina, et al.
Published: (2023)
by: Shvetsova, Nina, et al.
Published: (2023)
AVC-DPO: Aligned Video Captioning via Direct Preference Optimization
by: Tang, Jiyang, et al.
Published: (2025)
by: Tang, Jiyang, et al.
Published: (2025)
VCapsBench: A Large-scale Fine-grained Benchmark for Video Caption Quality Evaluation
by: Zhang, Shi-Xue, et al.
Published: (2025)
by: Zhang, Shi-Xue, et al.
Published: (2025)
LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning
by: Chao, Lianying, et al.
Published: (2026)
by: Chao, Lianying, et al.
Published: (2026)
Similar Items
-
Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions
by: Sun, Licai, et al.
Published: (2025) -
Progress-Aware Video Frame Captioning
by: Xue, Zihui, et al.
Published: (2024) -
Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
by: Zhang, Xu, et al.
Published: (2026) -
Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation
by: Cen, Zhigang, et al.
Published: (2024) -
Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations
by: Wang, Meng, et al.
Published: (2025)