Saved in:
| Main Author: | Xu, Tao |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.14162 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
by: Long, Xinwei, et al.
Published: (2025)
by: Long, Xinwei, et al.
Published: (2025)
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
by: Zhu, Yingjian, et al.
Published: (2026)
by: Zhu, Yingjian, et al.
Published: (2026)
Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering
by: Dong, Kuicai, et al.
Published: (2025)
by: Dong, Kuicai, et al.
Published: (2025)
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
by: Singh, Shubhankar, et al.
Published: (2024)
by: Singh, Shubhankar, et al.
Published: (2024)
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
by: Yan, Yibo, et al.
Published: (2026)
by: Yan, Yibo, et al.
Published: (2026)
Attention Grounded Enhancement for Visual Document Retrieval
by: Cui, Wanqing, et al.
Published: (2025)
by: Cui, Wanqing, et al.
Published: (2025)
A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task
by: Deng, Jiaqi, et al.
Published: (2025)
by: Deng, Jiaqi, et al.
Published: (2025)
Modality-Aware Integration with Large Language Models for Knowledge-based Visual Question Answering
by: Dong, Junnan, et al.
Published: (2024)
by: Dong, Junnan, et al.
Published: (2024)
Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
by: Raja, Rahul, et al.
Published: (2025)
by: Raja, Rahul, et al.
Published: (2025)
Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval
by: Liu, Zhuchenyang, et al.
Published: (2026)
by: Liu, Zhuchenyang, et al.
Published: (2026)
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
by: Wang, Qiuchen, et al.
Published: (2025)
by: Wang, Qiuchen, et al.
Published: (2025)
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
by: Yan, Yibo, et al.
Published: (2026)
by: Yan, Yibo, et al.
Published: (2026)
Personal Visual Memory from Explicit and Implicit Evidence
by: Nguyen, Viet, et al.
Published: (2026)
by: Nguyen, Viet, et al.
Published: (2026)
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory
by: Guo, Minghao, et al.
Published: (2026)
by: Guo, Minghao, et al.
Published: (2026)
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
by: Zhou, Junjie, et al.
Published: (2024)
by: Zhou, Junjie, et al.
Published: (2024)
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
by: Tanaka, Ryota, et al.
Published: (2025)
by: Tanaka, Ryota, et al.
Published: (2025)
TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations
by: Si, Jacob, et al.
Published: (2025)
by: Si, Jacob, et al.
Published: (2025)
Visual Lifelog Retrieval through Captioning-Enhanced Interpretation
by: Shih, Yu-Fei, et al.
Published: (2025)
by: Shih, Yu-Fei, et al.
Published: (2025)
Multi-Vector Index Compression in Any Modality
by: Qin, Hanxiang, et al.
Published: (2026)
by: Qin, Hanxiang, et al.
Published: (2026)
One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
by: Shereen, Ezzeldin, et al.
Published: (2025)
by: Shereen, Ezzeldin, et al.
Published: (2025)
Indexing Multimodal Language Models for Large-scale Image Retrieval
by: Tharwat, Bahey, et al.
Published: (2026)
by: Tharwat, Bahey, et al.
Published: (2026)
A Multi-Granularity Retrieval Framework for Visually-Rich Documents
by: Xu, Mingjun, et al.
Published: (2025)
by: Xu, Mingjun, et al.
Published: (2025)
Learning Visual Composition through Improved Semantic Guidance
by: Stone, Austin, et al.
Published: (2024)
by: Stone, Austin, et al.
Published: (2024)
FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering
by: Abaskohi, Amirhossein, et al.
Published: (2024)
by: Abaskohi, Amirhossein, et al.
Published: (2024)
Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?
by: Shen, Wenxuan, et al.
Published: (2025)
by: Shen, Wenxuan, et al.
Published: (2025)
ColPali: Efficient Document Retrieval with Vision Language Models
by: Faysse, Manuel, et al.
Published: (2024)
by: Faysse, Manuel, et al.
Published: (2024)
LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval
by: Kim, Seonok
Published: (2026)
by: Kim, Seonok
Published: (2026)
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
by: Liu, Peiyang, et al.
Published: (2026)
by: Liu, Peiyang, et al.
Published: (2026)
TransientTables: Evaluating LLMs' Reasoning on Temporally Evolving Semi-structured Tables
by: Shankarampeta, Abhilash, et al.
Published: (2025)
by: Shankarampeta, Abhilash, et al.
Published: (2025)
Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking
by: Xu, Zhengfei, et al.
Published: (2024)
by: Xu, Zhengfei, et al.
Published: (2024)
Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark
by: Guo, Hao, et al.
Published: (2025)
by: Guo, Hao, et al.
Published: (2025)
Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
by: Deng, Jiaqi, et al.
Published: (2025)
by: Deng, Jiaqi, et al.
Published: (2025)
Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
by: Zhang, Longxiang, et al.
Published: (2026)
by: Zhang, Longxiang, et al.
Published: (2026)
Improving Applicability of Deep Learning based Token Classification models during Training
by: Mehra, Anket, et al.
Published: (2025)
by: Mehra, Anket, et al.
Published: (2025)
Smart Multi-Modal Search: Contextual Sparse and Dense Embedding Integration in Adobe Express
by: Aroraa, Cherag, et al.
Published: (2024)
by: Aroraa, Cherag, et al.
Published: (2024)
Rethinking Detection Based Table Structure Recognition for Visually Rich Document Images
by: Xiao, Bin, et al.
Published: (2023)
by: Xiao, Bin, et al.
Published: (2023)
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
by: Shi, Zhuofan, et al.
Published: (2026)
by: Shi, Zhuofan, et al.
Published: (2026)
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
by: Yu, Shi, et al.
Published: (2024)
by: Yu, Shi, et al.
Published: (2024)
DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
by: Deng, Chenlong, et al.
Published: (2026)
by: Deng, Chenlong, et al.
Published: (2026)
From Videos to Indexed Knowledge Graphs -- Framework to Marry Methods for Multimodal Content Analysis and Understanding
by: Rizk, Basem, et al.
Published: (2025)
by: Rizk, Basem, et al.
Published: (2025)
Similar Items
-
Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
by: Long, Xinwei, et al.
Published: (2025) -
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
by: Zhu, Yingjian, et al.
Published: (2026) -
Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering
by: Dong, Kuicai, et al.
Published: (2025) -
FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
by: Singh, Shubhankar, et al.
Published: (2024) -
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
by: Yan, Yibo, et al.
Published: (2026)