Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Dong Won, Gillet, Sarah, Morency, Louis-Philippe, Breazeal, Cynthia, Park, Hae Won
Format:	Preprint
Published:	2026
Subjects:	Robotics
Online Access:	https://arxiv.org/abs/2602.04157
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Situated embodied conversation requires robots to interleave real-time dialogue with active perception: deciding what to look at, when to look, and what to say under tight latency constraints. We present a simple, minimal system recipe that pairs a real-time multimodal language model with a small set of tool interfaces for attention and active perception. We study six home-style scenarios that require frequent attention shifts and increasing perceptual scope. Across four system variants, we evaluate turn-level tool-decision correctness against human annotations and collect subjective ratings of interaction quality. Results indicate that real-time multimodal large language models and tool use for active perception is a promising direction for practical situated embodied conversation.

Similar Items