Salvato in:
| Autori principali: | , , , , , |
|---|---|
| Natura: | Preprint |
| Pubblicazione: |
2025
|
| Soggetti: | |
| Accesso online: | https://arxiv.org/abs/2504.09209 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866912464289923072 |
|---|---|
| author | Zhang, Xiangyue Li, Jianfang Zhang, Jiaxu Ren, Jianqiang Bo, Liefeng Tu, Zhigang |
| author_facet | Zhang, Xiangyue Li, Jianfang Zhang, Jiaxu Ren, Jianqiang Bo, Liefeng Tu, Zhigang |
| contents | Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2504_09209 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation Zhang, Xiangyue Li, Jianfang Zhang, Jiaxu Ren, Jianqiang Bo, Liefeng Tu, Zhigang Graphics Computer Vision and Pattern Recognition Sound Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion. |
| title | EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation |
| topic | Graphics Computer Vision and Pattern Recognition Sound |
| url | https://arxiv.org/abs/2504.09209 |