Saved in:
Bibliographic Details
Main Authors: Lee, Jeongah, Sarvghad, Ali
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.03478
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915600095248384
author Lee, Jeongah
Sarvghad, Ali
author_facet Lee, Jeongah
Sarvghad, Ali
contents Large multimodal models (LMMs) are increasingly capable of interpreting visualizations, yet they continue to struggle with spatial reasoning. One proposed strategy is decomposition, which breaks down complex visualizations into structured components. In this work, we examine the efficacy of scalable vector graphics (SVGs) as a decomposition strategy for improving LMMs' performance on floor plans comprehension. Floor plans serve as a valuable testbed because they combine geometry, topology, and semantics, and their reliable comprehension has real-world applications, such as accessibility for blind and low-vision individuals. We conducted an exploratory study with three LMMs (GPT-4o, Claude 3.7 Sonnet, and Llama 3.2 11B Vision Instruct) across 75 floor plans. Results show that combining SVG with raster input (SVG+PNG) improves performance on spatial understanding tasks but often hinders spatial reasoning, particularly in pathfinding. These findings highlight both the promise and limitations of decomposition as a strategy for advancing spatial visualization comprehension.
format Preprint
id arxiv_https___arxiv_org_abs_2511_03478
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SVG Decomposition for Enhancing Large Multimodal Models Visualization Comprehension: A Study with Floor Plans
Lee, Jeongah
Sarvghad, Ali
Human-Computer Interaction
Large multimodal models (LMMs) are increasingly capable of interpreting visualizations, yet they continue to struggle with spatial reasoning. One proposed strategy is decomposition, which breaks down complex visualizations into structured components. In this work, we examine the efficacy of scalable vector graphics (SVGs) as a decomposition strategy for improving LMMs' performance on floor plans comprehension. Floor plans serve as a valuable testbed because they combine geometry, topology, and semantics, and their reliable comprehension has real-world applications, such as accessibility for blind and low-vision individuals. We conducted an exploratory study with three LMMs (GPT-4o, Claude 3.7 Sonnet, and Llama 3.2 11B Vision Instruct) across 75 floor plans. Results show that combining SVG with raster input (SVG+PNG) improves performance on spatial understanding tasks but often hinders spatial reasoning, particularly in pathfinding. These findings highlight both the promise and limitations of decomposition as a strategy for advancing spatial visualization comprehension.
title SVG Decomposition for Enhancing Large Multimodal Models Visualization Comprehension: A Study with Floor Plans
topic Human-Computer Interaction
url https://arxiv.org/abs/2511.03478