Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Vanc, Petr, Stepanova, Karla
Format:	Preprint
Published:	2025
Subjects:	Robotics Human-Computer Interaction Machine Learning
Online Access:	https://arxiv.org/abs/2504.01708
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915223409000448
author	Vanc, Petr Stepanova, Karla
author_facet	Vanc, Petr Stepanova, Karla
contents	As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. 'Pick that red object'). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like "this"). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_01708
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication Vanc, Petr Stepanova, Karla Robotics Human-Computer Interaction Machine Learning As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. 'Pick that red object'). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like "this"). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.
title	TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication
topic	Robotics Human-Computer Interaction Machine Learning
url	https://arxiv.org/abs/2504.01708

Similar Items