:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Xu, Tao
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Computer Vision and Pattern Recognition Information Retrieval
Online Access:	https://arxiv.org/abs/2602.14162
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
by: Long, Xinwei, et al.
Published: (2025)

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
by: Zhu, Yingjian, et al.
Published: (2026)

Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering
by: Dong, Kuicai, et al.
Published: (2025)

FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
by: Singh, Shubhankar, et al.
Published: (2024)

Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
by: Yan, Yibo, et al.
Published: (2026)

Attention Grounded Enhancement for Visual Document Retrieval
by: Cui, Wanqing, et al.
Published: (2025)

A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task
by: Deng, Jiaqi, et al.
Published: (2025)

Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering
by: Dong, Junnan, et al.
Published: (2024)

Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
by: Raja, Rahul, et al.
Published: (2025)

Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval
by: Liu, Zhuchenyang, et al.
Published: (2026)

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
by: Wang, Qiuchen, et al.
Published: (2025)

Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
by: Yan, Yibo, et al.
Published: (2026)

Personal Visual Memory from Explicit and Implicit Evidence
by: Nguyen, Viet, et al.
Published: (2026)

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
by: Guo, Minghao, et al.
Published: (2026)

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
by: Zhou, Junjie, et al.
Published: (2024)

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
by: Tanaka, Ryota, et al.
Published: (2025)

TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations
by: Si, Jacob, et al.
Published: (2025)

Visual Lifelog Retrieval through Captioning-Enhanced Interpretation
by: Shih, Yu-Fei, et al.
Published: (2025)

Multi-Vector Index Compression in Any Modality
by: Qin, Hanxiang, et al.
Published: (2026)

One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
by: Shereen, Ezzeldin, et al.
Published: (2025)

Indexing Multimodal Language Models for Large-scale Image Retrieval
by: Tharwat, Bahey, et al.
Published: (2026)

A Multi-Granularity Retrieval Framework for Visually-Rich Documents
by: Xu, Mingjun, et al.
Published: (2025)

Learning Visual Composition through Improved Semantic Guidance
by: Stone, Austin, et al.
Published: (2024)

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering
by: Abaskohi, Amirhossein, et al.
Published: (2024)

Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?
by: Shen, Wenxuan, et al.
Published: (2025)

ColPali: Efficient Document Retrieval with Vision Language Models
by: Faysse, Manuel, et al.
Published: (2024)

LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval
by: Kim, Seonok
Published: (2026)

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
by: Liu, Peiyang, et al.
Published: (2026)

TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables
by: Shankarampeta, Abhilash, et al.
Published: (2025)

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking
by: Xu, Zhengfei, et al.
Published: (2024)

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
by: Guo, Hao, et al.
Published: (2025)

Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
by: Deng, Jiaqi, et al.
Published: (2025)

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
by: Zhang, Longxiang, et al.
Published: (2026)

Improving Applicability of Deep Learning based Token Classification models during Training
by: Mehra, Anket, et al.
Published: (2025)

Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express
by: Aroraa, Cherag, et al.
Published: (2024)

Rethinking Detection Based Table Structure Recognition for Visually Rich Document Images
by: Xiao, Bin, et al.
Published: (2023)

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
by: Shi, Zhuofan, et al.
Published: (2026)

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
by: Yu, Shi, et al.
Published: (2024)

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
by: Deng, Chenlong, et al.
Published: (2026)

From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding
by: Rizk, Basem, et al.
Published: (2025)