Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Guo, Ying, Liu, Xi, Zhen, Cheng, Yan, Pengfei, Wei, Xiaoming
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.00472
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912458947428352
author	Guo, Ying Liu, Xi Zhen, Cheng Yan, Pengfei Wei, Xiaoming
author_facet	Guo, Ying Liu, Xi Zhen, Cheng Yan, Pengfei Wei, Xiaoming
contents	Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_00472
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ARIG: Autoregressive Interactive Head Generation for Real-time Conversations Guo, Ying Liu, Xi Zhen, Cheng Yan, Pengfei Wei, Xiaoming Computer Vision and Pattern Recognition Face-to-face communication, as a common human activity, motivates the research on interactive head generation. A virtual agent can generate motion responses with both listening and speaking capabilities based on the audio or motion signals of the other user and itself. However, previous clip-wise generation paradigm or explicit listener/speaker generator-switching methods have limitations in future signal acquisition, contextual behavioral understanding, and switching smoothness, making it challenging to be real-time and realistic. In this paper, we propose an autoregressive (AR) based frame-wise framework called ARIG to realize the real-time generation with better interaction realism. To achieve real-time generation, we model motion prediction as a non-vector-quantized AR process. Unlike discrete codebook-index prediction, we represent motion distribution using diffusion procedure, achieving more accurate predictions in continuous space. To improve interaction realism, we emphasize interactive behavior understanding (IBU) and detailed conversational state understanding (CSU). In IBU, based on dual-track dual-modal signals, we summarize short-range behaviors through bidirectional-integrated learning and perform contextual understanding over long ranges. In CSU, we use voice activity signals and context features of IBU to understand the various states (interruption, feedback, pause, etc.) that exist in actual conversations. These serve as conditions for the final progressive motion prediction. Extensive experiments have verified the effectiveness of our model.
title	ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.00472

Similar Items