Saved in:
Bibliographic Details
Main Authors: Suresh, Varsha, Mughal, M. Hamza, Theobalt, Christian, Demberg, Vera
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.03474
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913720070832128
author Suresh, Varsha
Mughal, M. Hamza
Theobalt, Christian
Demberg, Vera
author_facet Suresh, Varsha
Mughal, M. Hamza
Theobalt, Christian
Demberg, Vera
contents Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.
format Preprint
id arxiv_https___arxiv_org_abs_2503_03474
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
Suresh, Varsha
Mughal, M. Hamza
Theobalt, Christian
Demberg, Vera
Computation and Language
Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.
title Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
topic Computation and Language
url https://arxiv.org/abs/2503.03474