Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Du, Yaxin, Song, Junru, Zhou, Yifan, Wang, Cheng, Gu, Jiahao, Chen, Zimeng, Chen, Menglan, Yao, Wen, Yang, Yang, Wen, Ying, Chen, Siheng
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2601.22055
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914291924336640
author	Du, Yaxin Song, Junru Zhou, Yifan Wang, Cheng Gu, Jiahao Chen, Zimeng Chen, Menglan Yao, Wen Yang, Yang Wen, Ying Chen, Siheng
author_facet	Du, Yaxin Song, Junru Zhou, Yifan Wang, Cheng Gu, Jiahao Chen, Zimeng Chen, Menglan Yao, Wen Yang, Yang Wen, Ying Chen, Siheng
contents	Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21\% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08\%).
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_22055
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	$G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA Du, Yaxin Song, Junru Zhou, Yifan Wang, Cheng Gu, Jiahao Chen, Zimeng Chen, Menglan Yao, Wen Yang, Yang Wen, Ying Chen, Siheng Computation and Language Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21\% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08\%).
title	$G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA
topic	Computation and Language
url	https://arxiv.org/abs/2601.22055

Similar Items