Saved in:
Bibliographic Details
Main Authors: Gonzalez, Antonio Galiza Cerdeira, Gajewski, Paweł, Indurkhya, Bipin
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2410.06355
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917472509100032
author Gonzalez, Antonio Galiza Cerdeira
Gajewski, Paweł
Indurkhya, Bipin
author_facet Gonzalez, Antonio Galiza Cerdeira
Gajewski, Paweł
Indurkhya, Bipin
contents This paper presents UNCOM, a novel hybrid framework for interpreting natural human commands in tabletop scenarios. The system integrates multiple sources of information -- speech, gestures, and scene context -- to extract structured, actionable instructions for robots. Addressing the need for general-purpose human-robot interaction in domestic environments, UNCOM is designed for zero-shot operation, without reliance on predefined object models or training data specific to a given task. Using foundational and task-specific deep learning models, it allows out-of-the-box speech recognition, natural language understanding, gesture detection, and object segmentation. The modular architecture enhances transparency and explainability by explicitly parsing commands into object-action-target representations, enabling integration with symbolic robotic frameworks. We demonstrate the system in a TIAGo++ robot and provide an evaluation on a real-world data set of human-robot interaction scenarios; achieving an 82.39\% success rate over our benchmark data set, highlighting the robustness of the system to diversity, noise, and communication ambiguity. The data set, evaluation scenarios, and the code are publicly available to support future research.
format Preprint
id arxiv_https___arxiv_org_abs_2410_06355
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle UNCOM: Zero-shot Context-Aware Command Understanding for Tabletop Scenarios
Gonzalez, Antonio Galiza Cerdeira
Gajewski, Paweł
Indurkhya, Bipin
Robotics
Artificial Intelligence
This paper presents UNCOM, a novel hybrid framework for interpreting natural human commands in tabletop scenarios. The system integrates multiple sources of information -- speech, gestures, and scene context -- to extract structured, actionable instructions for robots. Addressing the need for general-purpose human-robot interaction in domestic environments, UNCOM is designed for zero-shot operation, without reliance on predefined object models or training data specific to a given task. Using foundational and task-specific deep learning models, it allows out-of-the-box speech recognition, natural language understanding, gesture detection, and object segmentation. The modular architecture enhances transparency and explainability by explicitly parsing commands into object-action-target representations, enabling integration with symbolic robotic frameworks. We demonstrate the system in a TIAGo++ robot and provide an evaluation on a real-world data set of human-robot interaction scenarios; achieving an 82.39\% success rate over our benchmark data set, highlighting the robustness of the system to diversity, noise, and communication ambiguity. The data set, evaluation scenarios, and the code are publicly available to support future research.
title UNCOM: Zero-shot Context-Aware Command Understanding for Tabletop Scenarios
topic Robotics
Artificial Intelligence
url https://arxiv.org/abs/2410.06355