Saved in:
Bibliographic Details
Main Authors: Chen, Jiaben, Wang, Zixin, Zeng, Ailing, Fu, Yang, Yu, Xueyang, Cen, Siyuan, Tanke, Julian, Chen, Yihang, Saito, Koichi, Mitsufuji, Yuki, Gan, Chuang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.07249
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912643964469248
author Chen, Jiaben
Wang, Zixin
Zeng, Ailing
Fu, Yang
Yu, Xueyang
Cen, Siyuan
Tanke, Julian
Chen, Yihang
Saito, Koichi
Mitsufuji, Yuki
Gan, Chuang
author_facet Chen, Jiaben
Wang, Zixin
Zeng, Ailing
Fu, Yang
Yu, Xueyang
Cen, Siyuan
Tanke, Julian
Chen, Yihang
Saito, Koichi
Mitsufuji, Yuki
Gan, Chuang
contents In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
format Preprint
id arxiv_https___arxiv_org_abs_2510_07249
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation
Chen, Jiaben
Wang, Zixin
Zeng, Ailing
Fu, Yang
Yu, Xueyang
Cen, Siyuan
Tanke, Julian
Chen, Yihang
Saito, Koichi
Mitsufuji, Yuki
Gan, Chuang
Computer Vision and Pattern Recognition
In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
title TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2510.07249