MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Song, Sichao, Okafuji, Yuki, Ariu, Kaito, Koike, Amy
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Robotics
Accesso online:	https://arxiv.org/abs/2601.01969
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866908748221513728
author	Song, Sichao Okafuji, Yuki Ariu, Kaito Koike, Amy
author_facet	Song, Sichao Okafuji, Yuki Ariu, Kaito Koike, Amy
contents	Designing policies that are both efficient and acceptable for conversational service robots in open and diverse environments is non-trivial. Unlike fixed, hand-tuned parameters, online learning can adapt to non-stationary conditions. In this paper, we study how to adapt a social robot's speech policy in the wild. During a 12-day in-situ deployment with over 1,400 public encounters, we cast online policy optimization as a multi-armed bandit problem and use Thompson sampling to select among six actions defined by speech rate (slow/normal/fast) and verbosity (concise/detailed). We compare three complementary binary rewards--Ru (user rating), Rc (conversation closure), and Rt (>=2 turns)--and show that each induces distinct arm distributions and interaction behaviors. We complement the online results with offline evaluations that analyze contextual factors (e.g., crowd level, group size) using video-annotated data. Taken together, we distill ready-to-use design lessons for deploying online optimization of speech policies in real public HRI settings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_01969
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	What you reward is what you learn: Comparing rewards for online speech policy optimization in public HRI Song, Sichao Okafuji, Yuki Ariu, Kaito Koike, Amy Robotics Designing policies that are both efficient and acceptable for conversational service robots in open and diverse environments is non-trivial. Unlike fixed, hand-tuned parameters, online learning can adapt to non-stationary conditions. In this paper, we study how to adapt a social robot's speech policy in the wild. During a 12-day in-situ deployment with over 1,400 public encounters, we cast online policy optimization as a multi-armed bandit problem and use Thompson sampling to select among six actions defined by speech rate (slow/normal/fast) and verbosity (concise/detailed). We compare three complementary binary rewards--Ru (user rating), Rc (conversation closure), and Rt (>=2 turns)--and show that each induces distinct arm distributions and interaction behaviors. We complement the online results with offline evaluations that analyze contextual factors (e.g., crowd level, group size) using video-annotated data. Taken together, we distill ready-to-use design lessons for deploying online optimization of speech policies in real public HRI settings.
title	What you reward is what you learn: Comparing rewards for online speech policy optimization in public HRI
topic	Robotics
url	https://arxiv.org/abs/2601.01969

Documenti analoghi