Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Zirui, Zhang, Junyi, Ge, Jiaxin, Lian, Long, Fu, Letian, Dunlap, Lisa, Goldberg, Ken, Wang, XuDong, Stoica, Ion, Chan, David M., Min, Sewon, Gonzalez, Joseph E.
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.16973
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917219732029440
author	Wang, Zirui Zhang, Junyi Ge, Jiaxin Lian, Long Fu, Letian Dunlap, Lisa Goldberg, Ken Wang, XuDong Stoica, Ion Chan, David M. Min, Sewon Gonzalez, Joseph E.
author_facet	Wang, Zirui Zhang, Junyi Ge, Jiaxin Lian, Long Fu, Letian Dunlap, Lisa Goldberg, Ken Wang, XuDong Stoica, Ion Chan, David M. Min, Sewon Gonzalez, Joseph E.
contents	Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_16973
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Wang, Zirui Zhang, Junyi Ge, Jiaxin Lian, Long Fu, Letian Dunlap, Lisa Goldberg, Ken Wang, XuDong Stoica, Ion Chan, David M. Min, Sewon Gonzalez, Joseph E. Computer Vision and Pattern Recognition Modern Vision-Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (46.6%) and hard (26.0%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, highlighting concrete failure modes and pathways for improving multi-step visual decision-making. Code, data, and models can be found at: https://visgym.github.io/.
title	VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.16973

Similar Items