Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Suresh, Varsha, Mughal, M. Hamza, Theobalt, Christian, Demberg, Vera
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2503.03474
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913720070832128
author	Suresh, Varsha Mughal, M. Hamza Theobalt, Christian Demberg, Vera
author_facet	Suresh, Varsha Mughal, M. Hamza Theobalt, Christian Demberg, Vera
contents	Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_03474
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues Suresh, Varsha Mughal, M. Hamza Theobalt, Christian Demberg, Vera Computation and Language Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.
title	Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
topic	Computation and Language
url	https://arxiv.org/abs/2503.03474

Similar Items