Saved in:
| Main Authors: | Tian, Kaibin, Cheng, Yanhua, Liu, Yi, Hou, Xinglin, Chen, Quan, Li, Han |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.00701 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
by: Liu, Haowei, et al.
Published: (2024)
by: Liu, Haowei, et al.
Published: (2024)
Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
by: Zhao, Haoyu, et al.
Published: (2025)
by: Zhao, Haoyu, et al.
Published: (2025)
Beyond Coarse-Grained Matching in Video-Text Retrieval
by: Chen, Aozhu, et al.
Published: (2024)
by: Chen, Aozhu, et al.
Published: (2024)
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
by: Zhu, Yuke, et al.
Published: (2024)
by: Zhu, Yuke, et al.
Published: (2024)
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
by: Zhao, Ruixiang, et al.
Published: (2026)
by: Zhao, Ruixiang, et al.
Published: (2026)
Coarse-To-Fine Tensor Trains for Compact Visual Representations
by: Loeschcke, Sebastian, et al.
Published: (2024)
by: Loeschcke, Sebastian, et al.
Published: (2024)
Test-Time Temporal Sampling for Efficient MLLM Video Understanding
by: Wang, Kaibin, et al.
Published: (2025)
by: Wang, Kaibin, et al.
Published: (2025)
Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval
by: Jeong, Boseung, et al.
Published: (2025)
by: Jeong, Boseung, et al.
Published: (2025)
Adversarial Video Promotion Against Text-to-Video Retrieval
by: Tian, Qiwei, et al.
Published: (2025)
by: Tian, Qiwei, et al.
Published: (2025)
Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels
by: Zhang, Tongxu
Published: (2026)
by: Zhang, Tongxu
Published: (2026)
Multi-entity Video Transformers for Fine-Grained Video Representation Learning
by: Walmer, Matthew, et al.
Published: (2023)
by: Walmer, Matthew, et al.
Published: (2023)
EagleNet: Energy-Aware Fine-Grained Relationship Learning Network for Text-Video Retrieval
by: Chen, Yuhan, et al.
Published: (2026)
by: Chen, Yuhan, et al.
Published: (2026)
CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion
by: Guo, Yaowei, et al.
Published: (2025)
by: Guo, Yaowei, et al.
Published: (2025)
CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content
by: Han, Gyuwon, et al.
Published: (2026)
by: Han, Gyuwon, et al.
Published: (2026)
Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering
by: Zhao, Zhicheng, et al.
Published: (2024)
by: Zhao, Zhicheng, et al.
Published: (2024)
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
by: Cao, Meng, et al.
Published: (2024)
by: Cao, Meng, et al.
Published: (2024)
Text-Animator: Controllable Visual Text Video Generation
by: Liu, Lin, et al.
Published: (2024)
by: Liu, Lin, et al.
Published: (2024)
Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval
by: Cho, CH, et al.
Published: (2025)
by: Cho, CH, et al.
Published: (2025)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
by: Wang, Yuji, et al.
Published: (2025)
by: Wang, Yuji, et al.
Published: (2025)
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
by: Li, Yunheng, et al.
Published: (2025)
by: Li, Yunheng, et al.
Published: (2025)
Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
by: Kong, Quan, et al.
Published: (2026)
by: Kong, Quan, et al.
Published: (2026)
Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos
by: Tan, Zhiyu, et al.
Published: (2025)
by: Tan, Zhiyu, et al.
Published: (2025)
T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval
by: Li, Yili, et al.
Published: (2024)
by: Li, Yili, et al.
Published: (2024)
M$^2$IV: Towards Efficient and Fine-grained Multimodal In-Context Learning via Representation Engineering
by: Li, Yanshu, et al.
Published: (2025)
by: Li, Yanshu, et al.
Published: (2025)
TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
by: Shen, Leqi, et al.
Published: (2024)
by: Shen, Leqi, et al.
Published: (2024)
Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
by: Zhu, Minghao, et al.
Published: (2023)
by: Zhu, Minghao, et al.
Published: (2023)
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
by: Cui, Cheng, et al.
Published: (2026)
by: Cui, Cheng, et al.
Published: (2026)
HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval
by: Xie, Zequn, et al.
Published: (2026)
by: Xie, Zequn, et al.
Published: (2026)
Generative Recall, Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval
by: Zhao, Zecheng, et al.
Published: (2026)
by: Zhao, Zecheng, et al.
Published: (2026)
Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval
by: Ma, Zehong, et al.
Published: (2025)
by: Ma, Zehong, et al.
Published: (2025)
Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement
by: Gao, Zhirui, et al.
Published: (2023)
by: Gao, Zhirui, et al.
Published: (2023)
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
by: Shen, Xiaoqian, et al.
Published: (2025)
by: Shen, Xiaoqian, et al.
Published: (2025)
Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
by: Liu, Weijia, et al.
Published: (2025)
by: Liu, Weijia, et al.
Published: (2025)
DCP-CLIP:A Coarse-to-Fine Framework for Open-Vocabulary Semantic Segmentation with Dual Interaction
by: Wang, Jing, et al.
Published: (2026)
by: Wang, Jing, et al.
Published: (2026)
Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval
by: Zhang, Deyu, et al.
Published: (2025)
by: Zhang, Deyu, et al.
Published: (2025)
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
by: Shen, Leqi, et al.
Published: (2025)
by: Shen, Leqi, et al.
Published: (2025)
Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark
by: Yang, Shuyu, et al.
Published: (2025)
by: Yang, Shuyu, et al.
Published: (2025)
MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
by: Liu, Shanhui, et al.
Published: (2025)
by: Liu, Shanhui, et al.
Published: (2025)
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
by: Rao, Zhefan, et al.
Published: (2026)
by: Rao, Zhefan, et al.
Published: (2026)
Dyn-Adapter: Towards Disentangled Representation for Efficient Visual Recognition
by: Zhang, Yurong, et al.
Published: (2024)
by: Zhang, Yurong, et al.
Published: (2024)
Similar Items
-
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
by: Liu, Haowei, et al.
Published: (2024) -
Repeating Words for Video-Language Retrieval with Coarse-to-Fine Objectives
by: Zhao, Haoyu, et al.
Published: (2025) -
Beyond Coarse-Grained Matching in Video-Text Retrieval
by: Chen, Aozhu, et al.
Published: (2024) -
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
by: Zhu, Yuke, et al.
Published: (2024) -
SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
by: Zhao, Ruixiang, et al.
Published: (2026)