Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Mingxian, Huang, Wei, Li, Yitang, Jiang, Chengjie, Wu, Kui, Zhong, Fangwei, Qian, Shengju, Wang, Xin, Qi, Xiaojuan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2507.10548
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909689065766912
author	Lin, Mingxian Huang, Wei Li, Yitang Jiang, Chengjie Wu, Kui Zhong, Fangwei Qian, Shengju Wang, Xin Qi, Xiaojuan
author_facet	Lin, Mingxian Huang, Wei Li, Yitang Jiang, Chengjie Wu, Kui Zhong, Fangwei Qian, Shengju Wang, Xin Qi, Xiaojuan
contents	Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_10548
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	EmbRACE-3K: Embodied Reasoning and Action in Complex Environments Lin, Mingxian Huang, Wei Li, Yitang Jiang, Chengjie Wu, Kui Zhong, Fangwei Qian, Shengju Wang, Xin Qi, Xiaojuan Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent's intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset's effectiveness in enabling the development of embodied reasoning capabilities.
title	EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
topic	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2507.10548

Similar Items