Saved in:
| Main Authors: | Pei, Rongcan, Li, Huan, Guo, Fang, Zhu, Qi |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.10146 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)
by: Chen, Yukang, et al.
Published: (2024)
Internalized Reasoning for Long-Context Visual Document Understanding
by: Veselka, Austin
Published: (2026)
by: Veselka, Austin
Published: (2026)
From Text to Pixel: Advancing Long-Context Understanding in MLLMs
by: Lu, Yujie, et al.
Published: (2024)
by: Lu, Yujie, et al.
Published: (2024)
No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models
by: Sun, Min Woo, et al.
Published: (2025)
by: Sun, Min Woo, et al.
Published: (2025)
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
by: Gao, Sensen, et al.
Published: (2025)
by: Gao, Sensen, et al.
Published: (2025)
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
by: Li, Jiaang, et al.
Published: (2025)
by: Li, Jiaang, et al.
Published: (2025)
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
by: Zhu, Dawei, et al.
Published: (2025)
by: Zhu, Dawei, et al.
Published: (2025)
EviMem: Evidence-Gap-Driven Iterative Retrieval for Long-Term Conversational Memory
by: Li, Yuyang, et al.
Published: (2026)
by: Li, Yuyang, et al.
Published: (2026)
OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
by: Zhao, Tiancheng, et al.
Published: (2024)
by: Zhao, Tiancheng, et al.
Published: (2024)
Understanding Retrieval Robustness for Retrieval-Augmented Image Captioning
by: Li, Wenyan, et al.
Published: (2024)
by: Li, Wenyan, et al.
Published: (2024)
VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph
by: Wang, Qiuchen, et al.
Published: (2026)
by: Wang, Qiuchen, et al.
Published: (2026)
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
by: Ma, Yubo, et al.
Published: (2024)
by: Ma, Yubo, et al.
Published: (2024)
Retrieving Counterfactuals Improves Visual In-Context Learning
by: Xiong, Guangzhi, et al.
Published: (2026)
by: Xiong, Guangzhi, et al.
Published: (2026)
Support or Refute: Analyzing the Stance of Evidence to Detect Out-of-Context Mis- and Disinformation
by: Yuan, Xin, et al.
Published: (2023)
by: Yuan, Xin, et al.
Published: (2023)
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence
by: He, Jinghan, et al.
Published: (2024)
by: He, Jinghan, et al.
Published: (2024)
Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models
by: Li, Aaron Branson Cigres, et al.
Published: (2026)
by: Li, Aaron Branson Cigres, et al.
Published: (2026)
VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
by: Zhao, Hongbo, et al.
Published: (2025)
by: Zhao, Hongbo, et al.
Published: (2025)
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
by: Ma, David, et al.
Published: (2025)
by: Ma, David, et al.
Published: (2025)
VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation
by: Sun, Yubo, et al.
Published: (2025)
by: Sun, Yubo, et al.
Published: (2025)
GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models
by: Li, Mukai, et al.
Published: (2024)
by: Li, Mukai, et al.
Published: (2024)
EmoGist: Efficient In-Context Learning for Visual Emotion Understanding
by: Seoh, Ronald, et al.
Published: (2025)
by: Seoh, Ronald, et al.
Published: (2025)
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
How to Train Your Long-Context Visual Document Model
by: Veselka, Austin
Published: (2026)
by: Veselka, Austin
Published: (2026)
DynTok: Dynamic Compression of Visual Tokens for Efficient and Effective Video Understanding
by: Zhang, Hongzhi, et al.
Published: (2025)
by: Zhang, Hongzhi, et al.
Published: (2025)
Efficient End-to-End Visual Document Understanding with Rationale Distillation
by: Zhu, Wang, et al.
Published: (2023)
by: Zhu, Wang, et al.
Published: (2023)
Visual In-Context Learning for Large Vision-Language Models
by: Zhou, Yucheng, et al.
Published: (2024)
by: Zhou, Yucheng, et al.
Published: (2024)
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
by: Wan, Zhongwei, et al.
Published: (2024)
by: Wan, Zhongwei, et al.
Published: (2024)
Mitigating GenAI-powered Evidence Pollution for Out-of-Context Multimodal Misinformation Detection
by: Yan, Zehong, et al.
Published: (2025)
by: Yan, Zehong, et al.
Published: (2025)
XL-HeadTags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags
by: Shohan, Faisal Tareque, et al.
Published: (2024)
by: Shohan, Faisal Tareque, et al.
Published: (2024)
Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding
by: Xu, Hongshen, et al.
Published: (2024)
by: Xu, Hongshen, et al.
Published: (2024)
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
by: Sun, Hao, et al.
Published: (2026)
by: Sun, Hao, et al.
Published: (2026)
Seeing is Believing: Rich-Context Hallucination Detection for MLLMs via Backward Visual Grounding
by: Guo, Pinxue, et al.
Published: (2025)
by: Guo, Pinxue, et al.
Published: (2025)
Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
by: Wu, Yin, et al.
Published: (2025)
by: Wu, Yin, et al.
Published: (2025)
ReTaKe: Reducing Temporal and Knowledge Redundancy for Long Video Understanding
by: Wang, Xiao, et al.
Published: (2024)
by: Wang, Xiao, et al.
Published: (2024)
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
by: Song, Wei, et al.
Published: (2025)
by: Song, Wei, et al.
Published: (2025)
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
by: Luo, Chuwei, et al.
Published: (2022)
by: Luo, Chuwei, et al.
Published: (2022)
Context-aware Visual Storytelling with Visual Prefix Tuning and Contrastive Learning
by: Song, Yingjin, et al.
Published: (2024)
by: Song, Yingjin, et al.
Published: (2024)
Similar Items
-
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024) -
Internalized Reasoning for Long-Context Visual Document Understanding
by: Veselka, Austin
Published: (2026) -
From Text to Pixel: Advancing Long-Context Understanding in MLLMs
by: Lu, Yujie, et al.
Published: (2024) -
No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models
by: Sun, Min Woo, et al.
Published: (2025) -
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
by: Gao, Sensen, et al.
Published: (2025)