_version_ 1866908907029397504
author SII-GAIR
ai, Sand.
:
Chern, Ethan
Teng, Hansi
Sun, Hanwen
Wang, Hao
Pan, Hong
Jia, Hongyu
Su, Jiadi
Li, Jin
Yu, Junjie
Liu, Lijie
Li, Lingzhi
Ye, Lyumanshan
Hu, Min
Wang, Qiangang
Qi, Quanwei
Chern, Steffi
Bu, Tao
Wang, Taoran
Xu, Teren
Zhang, Tianning
Mi, Tiantian
Xu, Weixian
Zhang, Wenqiang
Zhang, Wentai
Yi, Xianping
Cai, Xiaojie
Kang, Xiaoyang
Ma, Yan
Liu, Yixiu
Zhang, Yunbo
Huang, Yunpeng
Lin, Yutong
Tao, Zewei
Liu, Zhaoliang
Zhang, Zheng
Cen, Zhiyao
Yu, Zhixuan
Wang, Zhongshu
Hu, Zhulin
Zhou, Zijin
Guo, Zinan
Cao, Yue
Liu, Pengfei
author_facet SII-GAIR
ai, Sand.
:
Chern, Ethan
Teng, Hansi
Sun, Hanwen
Wang, Hao
Pan, Hong
Jia, Hongyu
Su, Jiadi
Li, Jin
Yu, Junjie
Liu, Lijie
Li, Lingzhi
Ye, Lyumanshan
Hu, Min
Wang, Qiangang
Qi, Quanwei
Chern, Steffi
Bu, Tao
Wang, Taoran
Xu, Teren
Zhang, Tianning
Mi, Tiantian
Xu, Weixian
Zhang, Wenqiang
Zhang, Wentai
Yi, Xianping
Cai, Xiaojie
Kang, Xiaoyang
Ma, Yan
Liu, Yixiu
Zhang, Yunbo
Huang, Yunpeng
Lin, Yutong
Tao, Zewei
Liu, Zhaoliang
Zhang, Zheng
Cen, Zhiyao
Yu, Zhixuan
Wang, Zhongshu
Hu, Zhulin
Zhou, Zijin
Guo, Zinan
Cao, Yue
Liu, Pengfei
contents We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.
format Preprint
id arxiv_https___arxiv_org_abs_2603_21986
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
SII-GAIR
ai, Sand.
:
Chern, Ethan
Teng, Hansi
Sun, Hanwen
Wang, Hao
Pan, Hong
Jia, Hongyu
Su, Jiadi
Li, Jin
Yu, Junjie
Liu, Lijie
Li, Lingzhi
Ye, Lyumanshan
Hu, Min
Wang, Qiangang
Qi, Quanwei
Chern, Steffi
Bu, Tao
Wang, Taoran
Xu, Teren
Zhang, Tianning
Mi, Tiantian
Xu, Weixian
Zhang, Wenqiang
Zhang, Wentai
Yi, Xianping
Cai, Xiaojie
Kang, Xiaoyang
Ma, Yan
Liu, Yixiu
Zhang, Yunbo
Huang, Yunpeng
Lin, Yutong
Tao, Zewei
Liu, Zhaoliang
Zhang, Zheng
Cen, Zhiyao
Yu, Zhixuan
Wang, Zhongshu
Hu, Zhulin
Zhou, Zijin
Guo, Zinan
Cao, Yue
Liu, Pengfei
Computer Vision and Pattern Recognition
We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.
title Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.21986