Saved in:
| Main Authors: | Xu, Yifan, Li, Xinhao, Yang, Yichun, Meng, Desen, Huang, Rui, Wang, Limin |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.00513 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
by: Lu, Xuan, et al.
Published: (2026)
by: Lu, Xuan, et al.
Published: (2026)
Video Enriched Retrieval Augmented Generation Using Aligned Video Captions
by: Rosa, Kevin Dela
Published: (2024)
by: Rosa, Kevin Dela
Published: (2024)
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
by: Cai, Qifeng, et al.
Published: (2025)
by: Cai, Qifeng, et al.
Published: (2025)
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025)
by: Meng, Desen, et al.
Published: (2025)
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
by: Song, Tingyu, et al.
Published: (2026)
by: Song, Tingyu, et al.
Published: (2026)
TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval
by: Sun, Xinyu, et al.
Published: (2026)
by: Sun, Xinyu, et al.
Published: (2026)
Chain-of-Thought Re-ranking for Image Retrieval Tasks
by: Wu, Shangrong, et al.
Published: (2025)
by: Wu, Shangrong, et al.
Published: (2025)
Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval
by: Lu, Xin, et al.
Published: (2023)
by: Lu, Xin, et al.
Published: (2023)
Video Editing for Video Retrieval
by: Zhu, Bin, et al.
Published: (2024)
by: Zhu, Bin, et al.
Published: (2024)
Visual Lifelog Retrieval through Captioning-Enhanced Interpretation
by: Shih, Yu-Fei, et al.
Published: (2025)
by: Shih, Yu-Fei, et al.
Published: (2025)
ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval
by: Nguyen, Tien-Huy, et al.
Published: (2026)
by: Nguyen, Tien-Huy, et al.
Published: (2026)
LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos
by: Yu, Rongyi, et al.
Published: (2026)
by: Yu, Rongyi, et al.
Published: (2026)
Retrieval-Guided Generation for Safer Histopathology Image Captioning
by: Hoq, Md. Enamul, et al.
Published: (2026)
by: Hoq, Md. Enamul, et al.
Published: (2026)
MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval
by: Zhou, Junjie, et al.
Published: (2025)
by: Zhou, Junjie, et al.
Published: (2025)
PHPQ: Pyramid Hybrid Pooling Quantization for Efficient Fine-Grained Image Retrieval
by: Zeng, Ziyun, et al.
Published: (2021)
by: Zeng, Ziyun, et al.
Published: (2021)
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
by: Reddy, Arun, et al.
Published: (2025)
by: Reddy, Arun, et al.
Published: (2025)
VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
by: Ren, Xubin, et al.
Published: (2025)
by: Ren, Xubin, et al.
Published: (2025)
MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
by: Huang, Jinghao, et al.
Published: (2025)
by: Huang, Jinghao, et al.
Published: (2025)
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
by: Yu, Xuzheng, et al.
Published: (2024)
by: Yu, Xuzheng, et al.
Published: (2024)
Adapting MLLMs for Nuanced Video Retrieval
by: Bagad, Piyush, et al.
Published: (2025)
by: Bagad, Piyush, et al.
Published: (2025)
RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval
by: Skow, Tyler, et al.
Published: (2026)
by: Skow, Tyler, et al.
Published: (2026)
Event-aware Video Corpus Moment Retrieval
by: Hou, Danyang, et al.
Published: (2024)
by: Hou, Danyang, et al.
Published: (2024)
Interactive Multi-Turn Retrieval for Health Videos
by: Wu, Chengzheng, et al.
Published: (2026)
by: Wu, Chengzheng, et al.
Published: (2026)
Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval
by: Wang, Yifan, et al.
Published: (2025)
by: Wang, Yifan, et al.
Published: (2025)
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
by: Ouyang, Linke, et al.
Published: (2024)
by: Ouyang, Linke, et al.
Published: (2024)
DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
by: Deng, Chenlong, et al.
Published: (2026)
by: Deng, Chenlong, et al.
Published: (2026)
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement
by: Hou, Danyang, et al.
Published: (2024)
by: Hou, Danyang, et al.
Published: (2024)
MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
by: Tu, Rong-Cheng, et al.
Published: (2025)
by: Tu, Rong-Cheng, et al.
Published: (2025)
Global-to-Local or Local-to-Global? Enhancing Image Retrieval with Efficient Local Search and Effective Global Re-ranking
by: Aiger, Dror, et al.
Published: (2025)
by: Aiger, Dror, et al.
Published: (2025)
REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark
by: Wasserman, Navve, et al.
Published: (2025)
by: Wasserman, Navve, et al.
Published: (2025)
MARQUIS: A Three-Stage Pipeline for Video Retrieval-Augmented Generation
by: Chakraborty, Debashish, et al.
Published: (2026)
by: Chakraborty, Debashish, et al.
Published: (2026)
EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
by: Meng, GuangHao, et al.
Published: (2025)
by: Meng, GuangHao, et al.
Published: (2025)
Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
by: Zhang, Yao, et al.
Published: (2026)
by: Zhang, Yao, et al.
Published: (2026)
Re-ranking the Context for Multimodal Retrieval Augmented Generation
by: Mortaheb, Matin, et al.
Published: (2025)
by: Mortaheb, Matin, et al.
Published: (2025)
Attribute-Guided Multi-Level Attention Network for Fine-Grained Fashion Retrieval
by: Xiao, Ling, et al.
Published: (2022)
by: Xiao, Ling, et al.
Published: (2022)
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
by: Kim, Dahun, et al.
Published: (2025)
by: Kim, Dahun, et al.
Published: (2025)
Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering
by: Dong, Kuicai, et al.
Published: (2025)
by: Dong, Kuicai, et al.
Published: (2025)
Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)
by: Duan, Yicheng, et al.
Published: (2025)
by: Duan, Yicheng, et al.
Published: (2025)
PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
by: Xu, Tianyi, et al.
Published: (2026)
by: Xu, Tianyi, et al.
Published: (2026)
Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval
by: Molina, Adrià, et al.
Published: (2024)
by: Molina, Adrià, et al.
Published: (2024)
Similar Items
-
Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
by: Lu, Xuan, et al.
Published: (2026) -
Video Enriched Retrieval Augmented Generation Using Aligned Video Captions
by: Rosa, Kevin Dela
Published: (2024) -
LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
by: Cai, Qifeng, et al.
Published: (2025) -
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025) -
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
by: Song, Tingyu, et al.
Published: (2026)