Saved in:
Bibliographic Details
Main Authors: Vanc, Petr, Stepanova, Karla
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.01708
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915223409000448
author Vanc, Petr
Stepanova, Karla
author_facet Vanc, Petr
Stepanova, Karla
contents As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. 'Pick that red object'). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like "this"). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.
format Preprint
id arxiv_https___arxiv_org_abs_2504_01708
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication
Vanc, Petr
Stepanova, Karla
Robotics
Human-Computer Interaction
Machine Learning
As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. 'Pick that red object'). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like "this"). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.
title TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication
topic Robotics
Human-Computer Interaction
Machine Learning
url https://arxiv.org/abs/2504.01708