Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Han, Su, Kun, Zhang, Yutong, Chen, Jiaben, Qian, Kaizhi, Liu, Gaowen, Gan, Chuang
Format:	Preprint
Published:	2024
Subjects:	Sound Computer Vision and Pattern Recognition Graphics Machine Learning Multimedia Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2410.04534
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929529529827328
author	Yang, Han Su, Kun Zhang, Yutong Chen, Jiaben Qian, Kaizhi Liu, Gaowen Gan, Chuang
author_facet	Yang, Han Su, Kun Zhang, Yutong Chen, Jiaben Qian, Kaizhi Liu, Gaowen Gan, Chuang
contents	We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \href{https://hanyangclarence.github.io/unimumo_demo/}{project page}.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_04534
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	UniMuMo: Unified Text, Music and Motion Generation Yang, Han Su, Kun Zhang, Yutong Chen, Jiaben Qian, Kaizhi Liu, Gaowen Gan, Chuang Sound Computer Vision and Pattern Recognition Graphics Machine Learning Multimedia Audio and Speech Processing We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with a single training task of music-motion joint generation. Moreover, the model is designed by fine-tuning existing pre-trained single-modality models, significantly reducing computational demands. Extensive experiments demonstrate that UniMuMo achieves competitive results on all unidirectional generation benchmarks across music, motion, and text modalities. Quantitative results are available in the \href{https://hanyangclarence.github.io/unimumo_demo/}{project page}.
title	UniMuMo: Unified Text, Music and Motion Generation
topic	Sound Computer Vision and Pattern Recognition Graphics Machine Learning Multimedia Audio and Speech Processing
url	https://arxiv.org/abs/2410.04534

Similar Items