Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.21134 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913056527745024 |
|---|---|
| author | Lu, Yiyang Shin, Woong Karimi, Ahmad Maroof Wang, Feiyi Ren, Jie Smirni, Evgenia |
| author_facet | Lu, Yiyang Shin, Woong Karimi, Ahmad Maroof Wang, Feiyi Ren, Jie Smirni, Evgenia |
| contents | Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_21134 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents Lu, Yiyang Shin, Woong Karimi, Ahmad Maroof Wang, Feiyi Ren, Jie Smirni, Evgenia Computation and Language Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time. |
| title | Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2604.21134 |