Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.01591 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866929483874828288 |
|---|---|
| author | Anisetty, Sohan Hays, James |
| author_facet | Anisetty, Sohan Hays, James |
| contents | Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2409_01591 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers Anisetty, Sohan Hays, James Computer Vision and Pattern Recognition Graphics Our research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously, specifically text and audio inputs. Leveraging Vector Quantized Variational Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked Language Modeling (MLM) strategy for efficient token prediction, our approach achieves improved processing efficiency and coherence in the generated motions. By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions. This framework expands the possibilities of motion generation, addressing the limitations of existing approaches and opening avenues for multimodal motion synthesis. |
| title | Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers |
| topic | Computer Vision and Pattern Recognition Graphics |
| url | https://arxiv.org/abs/2409.01591 |