Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xu, Zhuoyan, Fang, Haoyang, Han, Boran, Min, Bonan, Wang, Bernie, Hu, Cuixiong, Zhang, Shuai
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2602.07642
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911431315685376
author	Xu, Zhuoyan Fang, Haoyang Han, Boran Min, Bonan Wang, Bernie Hu, Cuixiong Zhang, Shuai
author_facet	Xu, Zhuoyan Fang, Haoyang Han, Boran Min, Bonan Wang, Bernie Hu, Cuixiong Zhang, Shuai
contents	Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_07642
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Efficient Table Retrieval and Understanding with Multimodal Large Language Models Xu, Zhuoyan Fang, Haoyang Han, Boran Min, Bonan Wang, Bernie Hu, Cuixiong Zhang, Shuai Artificial Intelligence Machine Learning Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.
title	Efficient Table Retrieval and Understanding with Multimodal Large Language Models
topic	Artificial Intelligence Machine Learning
url	https://arxiv.org/abs/2602.07642

Similar Items