Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wang, Kun, Li, Yiming, Qu, Mingcheng, Zhang, Aqiang, Yang, Guang, Su, Tonghua
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Robotics
Online-Zugang:	https://arxiv.org/abs/2604.17241
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866917420104417280
author	Wang, Kun Li, Yiming Qu, Mingcheng Zhang, Aqiang Yang, Guang Su, Tonghua
author_facet	Wang, Kun Li, Yiming Qu, Mingcheng Zhang, Aqiang Yang, Guang Su, Tonghua
contents	Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_17241
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning Wang, Kun Li, Yiming Qu, Mingcheng Zhang, Aqiang Yang, Guang Su, Tonghua Robotics Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.
title	GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
topic	Robotics
url	https://arxiv.org/abs/2604.17241

Ähnliche Einträge