:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tao, Zhuo, Li, Liang, Chen, Qi, Tu, Yunbin, Zha, Zheng-Jun, Yang, Ming-Hsuan, Qi, Yuankai, Huang, Qingming
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.17651
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization
by: Ma, Yunchuan, et al.
Published: (2026)

Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment
by: Ma, Yunchuan, et al.
Published: (2026)

Context-aware Difference Distilling for Multi-change Captioning
by: Tu, Yunbin, et al.
Published: (2024)

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
by: Tu, Yunbin, et al.
Published: (2024)

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning
by: Tu, Yunbin, et al.
Published: (2024)

SOVC: Subject-Oriented Video Captioning
by: Teng, Chang, et al.
Published: (2023)

StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)

When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning
by: Liu, Yang, et al.
Published: (2025)

Downstream-Pretext Domain Knowledge Traceback for Active Learning
by: Zhang, Beichen, et al.
Published: (2024)

Self-supervised Representation Learning with Local Aggregation for Image-based Profiling
by: Dai, Siran, et al.
Published: (2025)

MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking
by: Han, Han, et al.
Published: (2024)

RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning
by: Ma, Yunchuan, et al.
Published: (2024)

Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos
by: Ma, Jianbo, et al.
Published: (2025)

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency
by: Qi, Mengshi, et al.
Published: (2025)

CoSyncDiT: Cognitive Synchronous Diffusion Transformer for Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2026)

Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation
by: Zhang, Yihao, et al.
Published: (2025)

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning
by: Tian, Mingkai, et al.
Published: (2025)

From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
by: Liu, Yang, et al.
Published: (2026)

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection
by: Huang, Kuan-Chih, et al.
Published: (2023)

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video
by: Qi, Zhaobo, et al.
Published: (2024)

Uncertainty-boosted Robust Video Activity Anticipation
by: Qi, Zhaobo, et al.
Published: (2024)

SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation
by: Pan, Jiadong, et al.
Published: (2024)

FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
by: Cong, Gaoxiang, et al.
Published: (2025)

Decorrelating Structure via Adapters Makes Ensemble Learning Practical for Semi-supervised Learning
by: Wu, Jiaqi, et al.
Published: (2024)

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
by: Cong, Gaoxiang, et al.
Published: (2024)

Exploring Primitive Visual Measurement Understanding and the Role of Output Format in Learning in Vision-Language Models
by: Yadav, Ankit, et al.
Published: (2025)

Exploring Structural Degradation in Dense Representations for Self-supervised Learning
by: Dai, Siran, et al.
Published: (2025)

SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting
by: Zhao, Yiming, et al.
Published: (2025)

Adapter-Enhanced Semantic Prompting for Continual Learning
by: Yin, Baocai, et al.
Published: (2024)

Point Cloud Mixture-of-Domain-Experts Model for 3D Self-supervised Learning
by: Zha, Yaohua, et al.
Published: (2024)

A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation
by: Yakolli, Nivedan, et al.
Published: (2025)

Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning
by: Wu, Jiaqi, et al.
Published: (2025)

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning
by: Jiang, Huajie, et al.
Published: (2025)

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering
by: Yu, Ting, et al.
Published: (2024)

EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction
by: Ge, Chengjie, et al.
Published: (2025)

Visual-Geometric Collaborative Guidance for Affordance Learning
by: Luo, Hongchen, et al.
Published: (2024)

FastInit: Fast Noise Initialization for Temporally Consistent Video Generation
by: Bai, Chengyu, et al.
Published: (2025)

StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning
by: Wang, Chuxin, et al.
Published: (2025)

Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual Learning
by: Jiang, Shengqin, et al.
Published: (2025)

Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer's Disease Diagnosis
by: Ma, Delin, et al.
Published: (2025)