Guardado en:
Detalles Bibliográficos
Autores principales: Zhang, Xiao, Li, Dongyuan, Xiang, Liuyu, Zhang, Yao, Zhong, Cheng, He, Zhaofeng
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2509.04457
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866917146181763072
author Zhang, Xiao
Li, Dongyuan
Xiang, Liuyu
Zhang, Yao
Zhong, Cheng
He, Zhaofeng
author_facet Zhang, Xiao
Li, Dongyuan
Xiang, Liuyu
Zhang, Yao
Zhong, Cheng
He, Zhaofeng
contents Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. We argue that current MLLMs rely largely on visual recognition rather than visual reasoning to interpret the charts, and visual estimation of numerical values is one of the most fundamental capabilities in chart understanding that require complex visual reasoning. To prove this, we introduce ChartVRBench, a benchmark meticulously designed to isolate and evaluate visual reasoning ability in chart understanding. Furthermore, we propose ChartVR-3B/7B trained with a novel Visual Reasoning Reinforcement Finetuning (VR-RFT) strategy to strengthen genuine chart visual reasoning abilities. Extensive experiments show that ChartVR achieves superior performance on ChartVRBench, outperforming even powerful proprietary models. Moreover, the visual reasoning skills cultivated by the proposed VR-RFT demonstrate strong generalization, leading to significant performance gains across a diverse suite of public chart understanding benchmarks. The code and dataset will be publicly available upon publication.
format Preprint
id arxiv_https___arxiv_org_abs_2509_04457
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Do MLLMs Really Understand the Charts?
Zhang, Xiao
Li, Dongyuan
Xiang, Liuyu
Zhang, Yao
Zhong, Cheng
He, Zhaofeng
Computation and Language
Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. We argue that current MLLMs rely largely on visual recognition rather than visual reasoning to interpret the charts, and visual estimation of numerical values is one of the most fundamental capabilities in chart understanding that require complex visual reasoning. To prove this, we introduce ChartVRBench, a benchmark meticulously designed to isolate and evaluate visual reasoning ability in chart understanding. Furthermore, we propose ChartVR-3B/7B trained with a novel Visual Reasoning Reinforcement Finetuning (VR-RFT) strategy to strengthen genuine chart visual reasoning abilities. Extensive experiments show that ChartVR achieves superior performance on ChartVRBench, outperforming even powerful proprietary models. Moreover, the visual reasoning skills cultivated by the proposed VR-RFT demonstrate strong generalization, leading to significant performance gains across a diverse suite of public chart understanding benchmarks. The code and dataset will be publicly available upon publication.
title Do MLLMs Really Understand the Charts?
topic Computation and Language
url https://arxiv.org/abs/2509.04457