Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Xu, Jiaqi, Huang, Tao, Zhang, Kai
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.00611
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914303140954112
author	Xu, Jiaqi Huang, Tao Zhang, Kai
author_facet	Xu, Jiaqi Huang, Tao Zhang, Kai
contents	Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework. We compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. We propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00611
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome Xu, Jiaqi Huang, Tao Zhang, Kai Artificial Intelligence Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework. We compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. We propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.
title	Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.00611

Similar Items