Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.22045 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908611551166464 |
|---|---|
| author | Kang, Hyeonsu Bao, Emily Goswami, Anjan |
| author_facet | Kang, Hyeonsu Bao, Emily Goswami, Anjan |
| contents | Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored {despite their growing role as critics in agentic, model-forward pipelines}. We introduce VLM-SlideEval, an evaluation framework that probes VLMs along three axes: (1) element-level extraction from slide images aligned to ground truth; (2) robustness to controlled perturbations in geometry, style, and text; and (3) higher-level comprehension, such as recovering a deck's narrative order from shuffled slides. Using publicly available decks from Zenodo (https://huggingface.co/datasets/Forceless/Zenodo10K/viewer/default/pptx), we standardize ground-truth element metadata from PowerPoint XML and live renderings into a unified, verifiable schema. Empirically, VLMs underperform on pixel-accurate extraction and show non-trivial agreement, fidelity, and consistency under controlled perturbations, while performing better on single-slide content understanding; however, they do not reliably capture narrative structure across slides. These results highlight the limits of current VLMs for slide evaluation and motivate calibrated, critic-in-the-loop evaluators that drive iterative refinement and selection in agentic pipelines. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_22045 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT Kang, Hyeonsu Bao, Emily Goswami, Anjan Computer Vision and Pattern Recognition Artificial Intelligence Vision-language models (VLMs) are increasingly used to evaluate multimodal content, including presentation slides, yet their slide-specific understanding remains underexplored {despite their growing role as critics in agentic, model-forward pipelines}. We introduce VLM-SlideEval, an evaluation framework that probes VLMs along three axes: (1) element-level extraction from slide images aligned to ground truth; (2) robustness to controlled perturbations in geometry, style, and text; and (3) higher-level comprehension, such as recovering a deck's narrative order from shuffled slides. Using publicly available decks from Zenodo (https://huggingface.co/datasets/Forceless/Zenodo10K/viewer/default/pptx), we standardize ground-truth element metadata from PowerPoint XML and live renderings into a unified, verifiable schema. Empirically, VLMs underperform on pixel-accurate extraction and show non-trivial agreement, fidelity, and consistency under controlled perturbations, while performing better on single-slide content understanding; however, they do not reliably capture narrative structure across slides. These results highlight the limits of current VLMs for slide evaluation and motivate calibrated, critic-in-the-loop evaluators that drive iterative refinement and selection in agentic pipelines. |
| title | VLM-SlideEval: Evaluating VLMs on Structured Comprehension and Perturbation Sensitivity in PPT |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence |
| url | https://arxiv.org/abs/2510.22045 |