Enregistré dans:
| Auteurs principaux: | , , |
|---|---|
| Format: | Preprint |
| Publié: |
2025
|
| Sujets: | |
| Accès en ligne: | https://arxiv.org/abs/2508.03050 |
| Tags: |
Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
|
| _version_ | 1866908478761598976 |
|---|---|
| author | Zhu, Zeyu Wu, Weijia Shou, Mike Zheng |
| author_facet | Zhu, Zeyu Wu, Weijia Shou, Mike Zheng |
| contents | Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2508_03050 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Multi-human Interactive Talking Dataset Zhu, Zeyu Wu, Weijia Shou, Mike Zheng Computer Vision and Pattern Recognition Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset. |
| title | Multi-human Interactive Talking Dataset |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2508.03050 |