Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Zeyuan, Xu, Hongyi, Song, Guoxian, Xie, You, Zhang, Chenxu, Chen, Xin, Wang, Chao, Chang, Di, Luo, Linjie
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.17414
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909684892434432
author	Chen, Zeyuan Xu, Hongyi Song, Guoxian Xie, You Zhang, Chenxu Chen, Xin Wang, Chao Chang, Di Luo, Linjie
author_facet	Chen, Zeyuan Xu, Hongyi Song, Guoxian Xie, You Zhang, Chenxu Chen, Xin Wang, Chao Chang, Di Luo, Linjie
contents	We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism. Code and model will be available for research purposes.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_17414
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	X-Dancer: Expressive Music to Human Dance Video Generation Chen, Zeyuan Xu, Hongyi Song, Guoxian Xie, You Zhang, Chenxu Chen, Xin Wang, Chao Chang, Di Luo, Linjie Computer Vision and Pattern Recognition We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism. Code and model will be available for research purposes.
title	X-Dancer: Expressive Music to Human Dance Video Generation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.17414

Similar Items