Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhang, Xiao, Li, Dongyuan, Xiang, Liuyu, Zhang, Yao, Zhong, Cheng, He, Zhaofeng
Formato:	Preprint
Publicado:	2025
Materias:	Computation and Language
Acceso en línea:	https://arxiv.org/abs/2509.04457
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866917146181763072
author	Zhang, Xiao Li, Dongyuan Xiang, Liuyu Zhang, Yao Zhong, Cheng He, Zhaofeng
author_facet	Zhang, Xiao Li, Dongyuan Xiang, Liuyu Zhang, Yao Zhong, Cheng He, Zhaofeng
contents	Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. We argue that current MLLMs rely largely on visual recognition rather than visual reasoning to interpret the charts, and visual estimation of numerical values is one of the most fundamental capabilities in chart understanding that require complex visual reasoning. To prove this, we introduce ChartVRBench, a benchmark meticulously designed to isolate and evaluate visual reasoning ability in chart understanding. Furthermore, we propose ChartVR-3B/7B trained with a novel Visual Reasoning Reinforcement Finetuning (VR-RFT) strategy to strengthen genuine chart visual reasoning abilities. Extensive experiments show that ChartVR achieves superior performance on ChartVRBench, outperforming even powerful proprietary models. Moreover, the visual reasoning skills cultivated by the proposed VR-RFT demonstrate strong generalization, leading to significant performance gains across a diverse suite of public chart understanding benchmarks. The code and dataset will be publicly available upon publication.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_04457
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Do MLLMs Really Understand the Charts? Zhang, Xiao Li, Dongyuan Xiang, Liuyu Zhang, Yao Zhong, Cheng He, Zhaofeng Computation and Language Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. We argue that current MLLMs rely largely on visual recognition rather than visual reasoning to interpret the charts, and visual estimation of numerical values is one of the most fundamental capabilities in chart understanding that require complex visual reasoning. To prove this, we introduce ChartVRBench, a benchmark meticulously designed to isolate and evaluate visual reasoning ability in chart understanding. Furthermore, we propose ChartVR-3B/7B trained with a novel Visual Reasoning Reinforcement Finetuning (VR-RFT) strategy to strengthen genuine chart visual reasoning abilities. Extensive experiments show that ChartVR achieves superior performance on ChartVRBench, outperforming even powerful proprietary models. Moreover, the visual reasoning skills cultivated by the proposed VR-RFT demonstrate strong generalization, leading to significant performance gains across a diverse suite of public chart understanding benchmarks. The code and dataset will be publicly available upon publication.
title	Do MLLMs Really Understand the Charts?
topic	Computation and Language
url	https://arxiv.org/abs/2509.04457

Ejemplares similares