:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Pei, Rongcan, Li, Huan, Guo, Fang, Zhu, Qi
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2602.10146
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)

Internalized Reasoning for Long-Context Visual Document Understanding
by: Veselka, Austin
Published: (2026)

From Text to Pixel: Advancing Long-Context Understanding in MLLMs
by: Lu, Yujie, et al.
Published: (2024)

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models
by: Sun, Min Woo, et al.
Published: (2025)

Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
by: Gao, Sensen, et al.
Published: (2025)

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
by: Li, Jiaang, et al.
Published: (2025)

DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
by: Zhu, Dawei, et al.
Published: (2025)

EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory
by: Li, Yuyang, et al.
Published: (2026)

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
by: Zhao, Tiancheng, et al.
Published: (2024)

Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
by: Li, Wenyan, et al.
Published: (2024)

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph
by: Wang, Qiuchen, et al.
Published: (2026)

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
by: Ma, Yubo, et al.
Published: (2024)

Retrieving Counterfactuals Improves Visual In-Context Learning
by: Xiong, Guangzhi, et al.
Published: (2026)

Support or Refute: Analyzing the Stance of Evidence to Detect Out-of-Context Mis- and Disinformation
by: Yuan, Xin, et al.
Published: (2023)

Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)

Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence
by: He, Jinghan, et al.
Published: (2024)

Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models
by: Li, Aaron Branson Cigres, et al.
Published: (2026)

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context
by: Li, Yunxin, et al.
Published: (2024)

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
by: Zhao, Hongbo, et al.
Published: (2025)

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
by: Ma, David, et al.
Published: (2025)

VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
by: Sun, Yubo, et al.
Published: (2025)

GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models
by: Li, Mukai, et al.
Published: (2024)

EmoGist: Efficient In-Context Learning for Visual Emotion Understanding
by: Seoh, Ronald, et al.
Published: (2025)

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)

How to Train Your Long-Context Visual Document Model
by: Veselka, Austin
Published: (2026)

DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
by: Zhang, Hongzhi, et al.
Published: (2025)

Efficient End-to-End Visual Document Understanding with Rationale Distillation
by: Zhu, Wang, et al.
Published: (2023)

Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
by: Wan, Zhongwei, et al.
Published: (2024)

Mitigating GenAI-powered Evidence Pollution for Out-of-Context Multimodal Misinformation Detection
by: Yan, Zehong, et al.
Published: (2025)

XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
by: Shohan, Faisal Tareque, et al.
Published: (2024)

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
by: Xu, Hongshen, et al.
Published: (2024)

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
by: Sun, Hao, et al.
Published: (2026)

Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding
by: Guo, Pinxue, et al.
Published: (2025)

Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
by: Wu, Yin, et al.
Published: (2025)

ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
by: Wang, Xiao, et al.
Published: (2024)

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
by: Song, Wei, et al.
Published: (2025)

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
by: Luo, Chuwei, et al.
Published: (2022)

Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning
by: Song, Yingjin, et al.
Published: (2024)