:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xu, Yifan, Li, Xinhao, Yang, Yichun, Meng, Desen, Huang, Rui, Wang, Limin
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Information Retrieval Machine Learning
Online Access:	https://arxiv.org/abs/2501.00513
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
by: Lu, Xuan, et al.
Published: (2026)

Video Enriched Retrieval Augmented Generation Using Aligned Video Captions
by: Rosa, Kevin Dela
Published: (2024)

LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts
by: Cai, Qifeng, et al.
Published: (2025)

VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking
by: Meng, Desen, et al.
Published: (2025)

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing
by: Song, Tingyu, et al.
Published: (2026)

TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval
by: Sun, Xinyu, et al.
Published: (2026)

Chain-of-Thought Re-ranking for Image Retrieval Tasks
by: Wu, Shangrong, et al.
Published: (2025)

Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval
by: Lu, Xin, et al.
Published: (2023)

Video Editing for Video Retrieval
by: Zhu, Bin, et al.
Published: (2024)

Visual Lifelog Retrieval through Captioning-Enhanced Interpretation
by: Shih, Yu-Fei, et al.
Published: (2025)

ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval
by: Nguyen, Tien-Huy, et al.
Published: (2026)

LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos
by: Yu, Rongyi, et al.
Published: (2026)

Retrieval-Guided Generation for Safer Histopathology Image Captioning
by: Hoq, Md. Enamul, et al.
Published: (2026)

MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval
by: Zhou, Junjie, et al.
Published: (2025)

PHPQ: Pyramid Hybrid Pooling Quantization for Efficient Fine-Grained Image Retrieval
by: Zeng, Ziyun, et al.
Published: (2021)

Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval
by: Reddy, Arun, et al.
Published: (2025)

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
by: Ren, Xubin, et al.
Published: (2025)

MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
by: Huang, Jinghao, et al.
Published: (2025)

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
by: Yu, Xuzheng, et al.
Published: (2024)

Adapting MLLMs for Nuanced Video Retrieval
by: Bagad, Piyush, et al.
Published: (2025)

RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval
by: Skow, Tyler, et al.
Published: (2026)

Event-aware Video Corpus Moment Retrieval
by: Hou, Danyang, et al.
Published: (2024)

Interactive Multi-Turn Retrieval for Health Videos
by: Wu, Chengzheng, et al.
Published: (2026)

Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval
by: Wang, Yifan, et al.
Published: (2025)

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
by: Ouyang, Linke, et al.
Published: (2024)

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
by: Deng, Chenlong, et al.
Published: (2026)

Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement
by: Hou, Danyang, et al.
Published: (2024)

MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval
by: Tu, Rong-Cheng, et al.
Published: (2025)

Global-to-Local or Local-to-Global? Enhancing Image Retrieval with Efficient Local Search and Effective Global Re-ranking
by: Aiger, Dror, et al.
Published: (2025)

REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark
by: Wasserman, Navve, et al.
Published: (2025)

MARQUIS: A Three-Stage Pipeline for Video Retrieval-Augmented Generation
by: Chakraborty, Debashish, et al.
Published: (2026)

EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models
by: Meng, GuangHao, et al.
Published: (2025)

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
by: Zhang, Yao, et al.
Published: (2026)

Re-ranking the Context for Multimodal Retrieval Augmented Generation
by: Mortaheb, Matin, et al.
Published: (2025)

Attribute-Guided Multi-Level Attention Network for Fine-Grained Fashion Retrieval
by: Xiao, Ling, et al.
Published: (2022)

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
by: Kim, Dahun, et al.
Published: (2025)

Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering
by: Dong, Kuicai, et al.
Published: (2025)

Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs)
by: Duan, Yicheng, et al.
Published: (2025)

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
by: Xu, Tianyi, et al.
Published: (2026)

Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval
by: Molina, Adrià, et al.
Published: (2024)