Enregistré dans:
Détails bibliographiques
Auteurs principaux: Kling Team, Chen, Jialu, Ding, Yikang, Fang, Zhixue, Gai, Kun, Gao, Yuan, He, Kang, Hua, Jingyun, Jiang, Boyuan, Lao, Mingming, Li, Xiaohan, Liu, Hui, Liu, Jiwen, Liu, Xiaoqiang, Liu, Yuan, Lu, Shun, Mao, Yongsen, Shao, Yingchao, Shi, Huafeng, Shi, Xiaoyu, Sun, Peiqin, Tang, Songlin, Wan, Pengfei, Wang, Chao, Wang, Xuebo, Zhang, Haoxian, Zhang, Yuanxing, Zhou, Yan
Format: Preprint
Publié: 2025
Sujets:
Accès en ligne:https://arxiv.org/abs/2512.13313
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866912765160980480
author Kling Team
Chen, Jialu
Ding, Yikang
Fang, Zhixue
Gai, Kun
Gao, Yuan
He, Kang
Hua, Jingyun
Jiang, Boyuan
Lao, Mingming
Li, Xiaohan
Liu, Hui
Liu, Jiwen
Liu, Xiaoqiang
Liu, Yuan
Lu, Shun
Mao, Yongsen
Shao, Yingchao
Shi, Huafeng
Shi, Xiaoyu
Sun, Peiqin
Tang, Songlin
Wan, Pengfei
Wang, Chao
Wang, Xuebo
Zhang, Haoxian
Zhang, Yuanxing
Zhou, Yan
author_facet Kling Team
Chen, Jialu
Ding, Yikang
Fang, Zhixue
Gai, Kun
Gao, Yuan
He, Kang
Hua, Jingyun
Jiang, Boyuan
Lao, Mingming
Li, Xiaohan
Liu, Hui
Liu, Jiwen
Liu, Xiaoqiang
Liu, Yuan
Lu, Shun
Mao, Yongsen
Shao, Yingchao
Shi, Huafeng
Shi, Xiaoyu
Sun, Peiqin
Tang, Songlin
Wan, Pengfei
Wang, Chao
Wang, Xuebo
Zhang, Haoxian
Zhang, Yuanxing
Zhou, Yan
contents Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
format Preprint
id arxiv_https___arxiv_org_abs_2512_13313
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle KlingAvatar 2.0 Technical Report
Kling Team
Chen, Jialu
Ding, Yikang
Fang, Zhixue
Gai, Kun
Gao, Yuan
He, Kang
Hua, Jingyun
Jiang, Boyuan
Lao, Mingming
Li, Xiaohan
Liu, Hui
Liu, Jiwen
Liu, Xiaoqiang
Liu, Yuan
Lu, Shun
Mao, Yongsen
Shao, Yingchao
Shi, Huafeng
Shi, Xiaoyu
Sun, Peiqin
Tang, Songlin
Wan, Pengfei
Wang, Chao
Wang, Xuebo
Zhang, Haoxian
Zhang, Yuanxing
Zhou, Yan
Computer Vision and Pattern Recognition
Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.
title KlingAvatar 2.0 Technical Report
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2512.13313