Saved in:
Bibliographic Details
Main Authors: Xu, Jiaqi, Huang, Tao, Zhang, Kai
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.00611
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914303140954112
author Xu, Jiaqi
Huang, Tao
Zhang, Kai
author_facet Xu, Jiaqi
Huang, Tao
Zhang, Kai
contents Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework. We compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. We propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.
format Preprint
id arxiv_https___arxiv_org_abs_2602_00611
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome
Xu, Jiaqi
Huang, Tao
Zhang, Kai
Artificial Intelligence
Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework. We compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. We propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.
title Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome
topic Artificial Intelligence
url https://arxiv.org/abs/2602.00611