Saved in:
Bibliographic Details
Main Authors: Anisetty, Sohan, Hays, James
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.01591
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929483874828288
author Anisetty, Sohan
Hays, James
author_facet Anisetty, Sohan
Hays, James
contents Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.
format Preprint
id arxiv_https___arxiv_org_abs_2409_01591
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers
Anisetty, Sohan
Hays, James
Computer Vision and Pattern Recognition
Graphics
Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.
title Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers
topic Computer Vision and Pattern Recognition
Graphics
url https://arxiv.org/abs/2409.01591