:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tian, Kaibin, Cheng, Yanhua, Liu, Yi, Hou, Xinglin, Chen, Quan, Li, Han
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2401.00701
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
by: Liu, Haowei, et al.
Published: (2024)

Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
by: Zhao, Haoyu, et al.
Published: (2025)

Beyond Coarse-Grained Matching in Video-Text Retrieval
by: Chen, Aozhu, et al.
Published: (2024)

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
by: Zhu, Yuke, et al.
Published: (2024)

SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
by: Zhao, Ruixiang, et al.
Published: (2026)

Coarse-To-Fine Tensor Trains for Compact Visual Representations
by: Loeschcke, Sebastian, et al.
Published: (2024)

Test-Time Temporal Sampling for Efficient MLLM Video Understanding
by: Wang, Kaibin, et al.
Published: (2025)

Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
by: Jeong, Boseung, et al.
Published: (2025)

Adversarial Video Promotion Against Text-to-Video Retrieval
by: Tian, Qiwei, et al.
Published: (2025)

Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels
by: Zhang, Tongxu
Published: (2026)

Multi-entity Video Transformers for Fine-Grained Video Representation Learning
by: Walmer, Matthew, et al.
Published: (2023)

EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
by: Chen, Yuhan, et al.
Published: (2026)

CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion
by: Guo, Yaowei, et al.
Published: (2025)

CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content
by: Han, Gyuwon, et al.
Published: (2026)

Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering
by: Zhao, Zhicheng, et al.
Published: (2024)

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
by: Cao, Meng, et al.
Published: (2024)

Text-Animator: Controllable Visual Text Video Generation
by: Liu, Lin, et al.
Published: (2024)

Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval
by: Cho, CH, et al.
Published: (2025)

Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
by: Wang, Yuji, et al.
Published: (2025)

TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
by: Li, Yunheng, et al.
Published: (2025)

Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
by: Kong, Quan, et al.
Published: (2026)

Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos
by: Tan, Zhiyu, et al.
Published: (2025)

T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval
by: Li, Yili, et al.
Published: (2024)

M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
by: Li, Yanshu, et al.
Published: (2025)

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
by: Shen, Leqi, et al.
Published: (2024)

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
by: Zhu, Minghao, et al.
Published: (2023)

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
by: Cui, Cheng, et al.
Published: (2026)

HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval
by: Xie, Zequn, et al.
Published: (2026)

Generative Recall, Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval
by: Zhao, Zecheng, et al.
Published: (2026)

Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval
by: Ma, Zehong, et al.
Published: (2025)

Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement
by: Gao, Zhirui, et al.
Published: (2023)

Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
by: Shen, Xiaoqian, et al.
Published: (2025)

Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
by: Liu, Weijia, et al.
Published: (2025)

DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction
by: Wang, Jing, et al.
Published: (2026)

Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval
by: Zhang, Deyu, et al.
Published: (2025)

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
by: Shen, Leqi, et al.
Published: (2025)

Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
by: Yang, Shuyu, et al.
Published: (2025)

MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
by: Liu, Shanhui, et al.
Published: (2025)

InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
by: Rao, Zhefan, et al.
Published: (2026)

Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition
by: Zhang, Yurong, et al.
Published: (2024)