Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.10248 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917935004516352 |
|---|---|
| author | Ma, Guoqing Huang, Haoyang Yan, Kun Chen, Liangyu Duan, Nan Yin, Shengming Wan, Changyi Ming, Ranchen Song, Xiaoniu Chen, Xing Zhou, Yu Sun, Deshan Zhou, Deyu Zhou, Jian Tan, Kaijun An, Kang Chen, Mei Ji, Wei Wu, Qiling Sun, Wen Han, Xin Wei, Yanan Ge, Zheng Li, Aojie Wang, Bin Huang, Bizhu Wang, Bo Li, Brian Miao, Changxing Xu, Chen Wu, Chenfei Yu, Chenguang Shi, Dapeng Hu, Dingyuan Liu, Enle Yu, Gang Yang, Ge Huang, Guanzhe Yan, Gulin Feng, Haiyang Nie, Hao Jia, Haonan Hu, Hanpeng Chen, Hanqi Yan, Haolong Wang, Heng Guo, Hongcheng Xiong, Huilin Xiong, Huixin Gong, Jiahao Wu, Jianchang Wu, Jiaoren Wu, Jie Yang, Jie Liu, Jiashuai Li, Jiashuo Zhang, Jingyang Guo, Junjing Lin, Junzhe Li, Kaixiang Liu, Lei Xia, Lei Zhao, Liang Tan, Liguo Huang, Liwen Shi, Liying Li, Ming Li, Mingliang Cheng, Muhua Wang, Na Chen, Qiaohui He, Qinglin Liang, Qiuyan Sun, Quan Sun, Ran Wang, Rui Pang, Shaoliang Yang, Shiliang Liu, Sitong Liu, Siqi Gao, Shuli Cao, Tiancheng Wang, Tianyu Ming, Weipeng He, Wenqing Zhao, Xu Zhang, Xuelin Zeng, Xianfang Liu, Xiaojia Yang, Xuan Dai, Yaqi Yu, Yanbo Li, Yang Deng, Yineng Wang, Yingming Wang, Yilei Lu, Yuanwei Chen, Yu Luo, Yu Luo, Yuchu Yin, Yuhe Feng, Yuheng Yang, Yuxiang Tang, Zecheng Zhang, Zekai Yang, Zidong Jiao, Binxing Chen, Jiansheng Li, Jing Zhou, Shuchang Zhang, Xiangyu Zhang, Xinhao Zhu, Yibo Shum, Heung-Yeung Jiang, Daxin |
| author_facet | Ma, Guoqing Huang, Haoyang Yan, Kun Chen, Liangyu Duan, Nan Yin, Shengming Wan, Changyi Ming, Ranchen Song, Xiaoniu Chen, Xing Zhou, Yu Sun, Deshan Zhou, Deyu Zhou, Jian Tan, Kaijun An, Kang Chen, Mei Ji, Wei Wu, Qiling Sun, Wen Han, Xin Wei, Yanan Ge, Zheng Li, Aojie Wang, Bin Huang, Bizhu Wang, Bo Li, Brian Miao, Changxing Xu, Chen Wu, Chenfei Yu, Chenguang Shi, Dapeng Hu, Dingyuan Liu, Enle Yu, Gang Yang, Ge Huang, Guanzhe Yan, Gulin Feng, Haiyang Nie, Hao Jia, Haonan Hu, Hanpeng Chen, Hanqi Yan, Haolong Wang, Heng Guo, Hongcheng Xiong, Huilin Xiong, Huixin Gong, Jiahao Wu, Jianchang Wu, Jiaoren Wu, Jie Yang, Jie Liu, Jiashuai Li, Jiashuo Zhang, Jingyang Guo, Junjing Lin, Junzhe Li, Kaixiang Liu, Lei Xia, Lei Zhao, Liang Tan, Liguo Huang, Liwen Shi, Liying Li, Ming Li, Mingliang Cheng, Muhua Wang, Na Chen, Qiaohui He, Qinglin Liang, Qiuyan Sun, Quan Sun, Ran Wang, Rui Pang, Shaoliang Yang, Shiliang Liu, Sitong Liu, Siqi Gao, Shuli Cao, Tiancheng Wang, Tianyu Ming, Weipeng He, Wenqing Zhao, Xu Zhang, Xuelin Zeng, Xianfang Liu, Xiaojia Yang, Xuan Dai, Yaqi Yu, Yanbo Li, Yang Deng, Yineng Wang, Yingming Wang, Yilei Lu, Yuanwei Chen, Yu Luo, Yu Luo, Yuchu Yin, Yuhe Feng, Yuheng Yang, Yuxiang Tang, Zecheng Zhang, Zekai Yang, Zidong Jiao, Binxing Chen, Jiansheng Li, Jing Zhou, Shuchang Zhang, Xiangyu Zhang, Xinhao Zhu, Yibo Shum, Heung-Yeung Jiang, Daxin |
| contents | We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2502_10248 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model Ma, Guoqing Huang, Haoyang Yan, Kun Chen, Liangyu Duan, Nan Yin, Shengming Wan, Changyi Ming, Ranchen Song, Xiaoniu Chen, Xing Zhou, Yu Sun, Deshan Zhou, Deyu Zhou, Jian Tan, Kaijun An, Kang Chen, Mei Ji, Wei Wu, Qiling Sun, Wen Han, Xin Wei, Yanan Ge, Zheng Li, Aojie Wang, Bin Huang, Bizhu Wang, Bo Li, Brian Miao, Changxing Xu, Chen Wu, Chenfei Yu, Chenguang Shi, Dapeng Hu, Dingyuan Liu, Enle Yu, Gang Yang, Ge Huang, Guanzhe Yan, Gulin Feng, Haiyang Nie, Hao Jia, Haonan Hu, Hanpeng Chen, Hanqi Yan, Haolong Wang, Heng Guo, Hongcheng Xiong, Huilin Xiong, Huixin Gong, Jiahao Wu, Jianchang Wu, Jiaoren Wu, Jie Yang, Jie Liu, Jiashuai Li, Jiashuo Zhang, Jingyang Guo, Junjing Lin, Junzhe Li, Kaixiang Liu, Lei Xia, Lei Zhao, Liang Tan, Liguo Huang, Liwen Shi, Liying Li, Ming Li, Mingliang Cheng, Muhua Wang, Na Chen, Qiaohui He, Qinglin Liang, Qiuyan Sun, Quan Sun, Ran Wang, Rui Pang, Shaoliang Yang, Shiliang Liu, Sitong Liu, Siqi Gao, Shuli Cao, Tiancheng Wang, Tianyu Ming, Weipeng He, Wenqing Zhao, Xu Zhang, Xuelin Zeng, Xianfang Liu, Xiaojia Yang, Xuan Dai, Yaqi Yu, Yanbo Li, Yang Deng, Yineng Wang, Yingming Wang, Yilei Lu, Yuanwei Chen, Yu Luo, Yu Luo, Yuchu Yin, Yuhe Feng, Yuheng Yang, Yuxiang Tang, Zecheng Zhang, Zekai Yang, Zidong Jiao, Binxing Chen, Jiansheng Li, Jing Zhou, Shuchang Zhang, Xiangyu Zhang, Xinhao Zhu, Yibo Shum, Heung-Yeung Jiang, Daxin Computer Vision and Pattern Recognition Computation and Language We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators. |
| title | Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model |
| topic | Computer Vision and Pattern Recognition Computation and Language |
| url | https://arxiv.org/abs/2502.10248 |