Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Anisetty, Sohan, Hays, James
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Graphics
Online Access:	https://arxiv.org/abs/2409.01591
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929483874828288
author	Anisetty, Sohan Hays, James
author_facet	Anisetty, Sohan Hays, James
contents	Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.
format	Preprint
id	arxiv_https___arxiv_org_abs_2409_01591
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers Anisetty, Sohan Hays, James Computer Vision and Pattern Recognition Graphics Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis.
title	Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers
topic	Computer Vision and Pattern Recognition Graphics
url	https://arxiv.org/abs/2409.01591

Similar Items