Saved in:
| Main Authors: | Tao, Zhuo, Li, Liang, Chen, Qi, Tu, Yunbin, Zha, Zheng-Jun, Yang, Ming-Hsuan, Qi, Yuankai, Huang, Qingming |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.17651 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization
by: Ma, Yunchuan, et al.
Published: (2026)
by: Ma, Yunchuan, et al.
Published: (2026)
Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment
by: Ma, Yunchuan, et al.
Published: (2026)
by: Ma, Yunchuan, et al.
Published: (2026)
Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
by: Tu, Yunbin, et al.
Published: (2024)
by: Tu, Yunbin, et al.
Published: (2024)
SOVC: Subject-Oriented Video Captioning
by: Teng, Chang, et al.
Published: (2023)
by: Teng, Chang, et al.
Published: (2023)
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)
by: Cong, Gaoxiang, et al.
Published: (2024)
When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
by: Liu, Yang, et al.
Published: (2025)
by: Liu, Yang, et al.
Published: (2025)
Downstream-Pretext Domain Knowledge Traceback for Active Learning
by: Zhang, Beichen, et al.
Published: (2024)
by: Zhang, Beichen, et al.
Published: (2024)
Self-supervised Representation Learning with Local Aggregation for Image-based Profiling
by: Dai, Siran, et al.
Published: (2025)
by: Dai, Siran, et al.
Published: (2025)
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking
by: Han, Han, et al.
Published: (2024)
by: Han, Han, et al.
Published: (2024)
RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning
by: Ma, Yunchuan, et al.
Published: (2024)
by: Ma, Yunchuan, et al.
Published: (2024)
Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos
by: Ma, Jianbo, et al.
Published: (2025)
by: Ma, Jianbo, et al.
Published: (2025)
DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency
by: Qi, Mengshi, et al.
Published: (2025)
by: Qi, Mengshi, et al.
Published: (2025)
CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2026)
by: Cong, Gaoxiang, et al.
Published: (2026)
Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation
by: Zhang, Yihao, et al.
Published: (2025)
by: Zhang, Yihao, et al.
Published: (2025)
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
by: Tian, Mingkai, et al.
Published: (2025)
by: Tian, Mingkai, et al.
Published: (2025)
From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
by: Liu, Yang, et al.
Published: (2026)
by: Liu, Yang, et al.
Published: (2026)
PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection
by: Huang, Kuan-Chih, et al.
Published: (2023)
by: Huang, Kuan-Chih, et al.
Published: (2023)
Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video
by: Qi, Zhaobo, et al.
Published: (2024)
by: Qi, Zhaobo, et al.
Published: (2024)
Uncertainty-boosted Robust Video Activity Anticipation
by: Qi, Zhaobo, et al.
Published: (2024)
by: Qi, Zhaobo, et al.
Published: (2024)
SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation
by: Pan, Jiadong, et al.
Published: (2024)
by: Pan, Jiadong, et al.
Published: (2024)
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
by: Cong, Gaoxiang, et al.
Published: (2025)
by: Cong, Gaoxiang, et al.
Published: (2025)
Decorrelating Structure via Adapters Makes Ensemble Learning Practical for Semi-supervised Learning
by: Wu, Jiaqi, et al.
Published: (2024)
by: Wu, Jiaqi, et al.
Published: (2024)
EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)
by: Cong, Gaoxiang, et al.
Published: (2024)
Exploring Primitive Visual Measurement Understanding and the Role of Output Format in Learning in Vision-Language Models
by: Yadav, Ankit, et al.
Published: (2025)
by: Yadav, Ankit, et al.
Published: (2025)
Exploring Structural Degradation in Dense Representations for Self-supervised Learning
by: Dai, Siran, et al.
Published: (2025)
by: Dai, Siran, et al.
Published: (2025)
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting
by: Zhao, Yiming, et al.
Published: (2025)
by: Zhao, Yiming, et al.
Published: (2025)
Adapter-Enhanced Semantic Prompting for Continual Learning
by: Yin, Baocai, et al.
Published: (2024)
by: Yin, Baocai, et al.
Published: (2024)
Point Cloud Mixture-of-Domain-Experts Model for 3D Self-supervised Learning
by: Zha, Yaohua, et al.
Published: (2024)
by: Zha, Yaohua, et al.
Published: (2024)
A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation
by: Yakolli, Nivedan, et al.
Published: (2025)
by: Yakolli, Nivedan, et al.
Published: (2025)
Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning
by: Wu, Jiaqi, et al.
Published: (2025)
by: Wu, Jiaqi, et al.
Published: (2025)
Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning
by: Jiang, Huajie, et al.
Published: (2025)
by: Jiang, Huajie, et al.
Published: (2025)
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering
by: Yu, Ting, et al.
Published: (2024)
by: Yu, Ting, et al.
Published: (2024)
EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction
by: Ge, Chengjie, et al.
Published: (2025)
by: Ge, Chengjie, et al.
Published: (2025)
Visual-Geometric Collaborative Guidance for Affordance Learning
by: Luo, Hongchen, et al.
Published: (2024)
by: Luo, Hongchen, et al.
Published: (2024)
FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
by: Bai, Chengyu, et al.
Published: (2025)
by: Bai, Chengyu, et al.
Published: (2025)
StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
by: Wang, Chuxin, et al.
Published: (2025)
by: Wang, Chuxin, et al.
Published: (2025)
Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning
by: Jiang, Shengqin, et al.
Published: (2025)
by: Jiang, Shengqin, et al.
Published: (2025)
Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer's Disease Diagnosis
by: Ma, Delin, et al.
Published: (2025)
by: Ma, Delin, et al.
Published: (2025)
Similar Items
-
Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization
by: Ma, Yunchuan, et al.
Published: (2026) -
Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment
by: Ma, Yunchuan, et al.
Published: (2026) -
Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024) -
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
by: Tu, Yunbin, et al.
Published: (2024) -
Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
by: Tu, Yunbin, et al.
Published: (2024)