Salvato in:
Dettagli Bibliografici
Autori principali: Ding, Yikang, Liu, Jiwen, Zhang, Wenyuan, Wang, Zekun, Hu, Wentao, Cui, Liyuan, Lao, Mingming, Shao, Yingchao, Liu, Hui, Li, Xiaohan, Chen, Ming, Liu, Xiaoqiang, Liu, Yu-Shen, Wan, Pengfei
Natura: Preprint
Pubblicazione: 2025
Soggetti:
Accesso online:https://arxiv.org/abs/2509.09595
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866909791579799552
author Ding, Yikang
Liu, Jiwen
Zhang, Wenyuan
Wang, Zekun
Hu, Wentao
Cui, Liyuan
Lao, Mingming
Shao, Yingchao
Liu, Hui
Li, Xiaohan
Chen, Ming
Liu, Xiaoqiang
Liu, Yu-Shen
Wan, Pengfei
author_facet Ding, Yikang
Liu, Jiwen
Zhang, Wenyuan
Wang, Zekun
Hu, Wentao
Cui, Liyuan
Lao, Mingming
Shao, Yingchao
Liu, Hui
Li, Xiaohan
Chen, Ming
Liu, Xiaoqiang
Liu, Yu-Shen
Wan, Pengfei
contents Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
format Preprint
id arxiv_https___arxiv_org_abs_2509_09595
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
Ding, Yikang
Liu, Jiwen
Zhang, Wenyuan
Wang, Zekun
Hu, Wentao
Cui, Liyuan
Lao, Mingming
Shao, Yingchao
Liu, Hui
Li, Xiaohan
Chen, Ming
Liu, Xiaoqiang
Liu, Yu-Shen
Wan, Pengfei
Computer Vision and Pattern Recognition
Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.
title Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2509.09595