Saved in:
Bibliographic Details
Main Authors: Lu, Yiyang, Shin, Woong, Karimi, Ahmad Maroof, Wang, Feiyi, Ren, Jie, Smirni, Evgenia
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.21134
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913056527745024
author Lu, Yiyang
Shin, Woong
Karimi, Ahmad Maroof
Wang, Feiyi
Ren, Jie
Smirni, Evgenia
author_facet Lu, Yiyang
Shin, Woong
Karimi, Ahmad Maroof
Wang, Feiyi
Ren, Jie
Smirni, Evgenia
contents Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.
format Preprint
id arxiv_https___arxiv_org_abs_2604_21134
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents
Lu, Yiyang
Shin, Woong
Karimi, Ahmad Maroof
Wang, Feiyi
Ren, Jie
Smirni, Evgenia
Computation and Language
Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.
title Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents
topic Computation and Language
url https://arxiv.org/abs/2604.21134