Saved in:
Bibliographic Details
Main Authors: Ma, Guoqing, Huang, Haoyang, Yan, Kun, Chen, Liangyu, Duan, Nan, Yin, Shengming, Wan, Changyi, Ming, Ranchen, Song, Xiaoniu, Chen, Xing, Zhou, Yu, Sun, Deshan, Zhou, Deyu, Zhou, Jian, Tan, Kaijun, An, Kang, Chen, Mei, Ji, Wei, Wu, Qiling, Sun, Wen, Han, Xin, Wei, Yanan, Ge, Zheng, Li, Aojie, Wang, Bin, Huang, Bizhu, Wang, Bo, Li, Brian, Miao, Changxing, Xu, Chen, Wu, Chenfei, Yu, Chenguang, Shi, Dapeng, Hu, Dingyuan, Liu, Enle, Yu, Gang, Yang, Ge, Huang, Guanzhe, Yan, Gulin, Feng, Haiyang, Nie, Hao, Jia, Haonan, Hu, Hanpeng, Chen, Hanqi, Yan, Haolong, Wang, Heng, Guo, Hongcheng, Xiong, Huilin, Xiong, Huixin, Gong, Jiahao, Wu, Jianchang, Wu, Jiaoren, Wu, Jie, Yang, Jie, Liu, Jiashuai, Li, Jiashuo, Zhang, Jingyang, Guo, Junjing, Lin, Junzhe, Li, Kaixiang, Liu, Lei, Xia, Lei, Zhao, Liang, Tan, Liguo, Huang, Liwen, Shi, Liying, Li, Ming, Li, Mingliang, Cheng, Muhua, Wang, Na, Chen, Qiaohui, He, Qinglin, Liang, Qiuyan, Sun, Quan, Sun, Ran, Wang, Rui, Pang, Shaoliang, Yang, Shiliang, Liu, Sitong, Liu, Siqi, Gao, Shuli, Cao, Tiancheng, Wang, Tianyu, Ming, Weipeng, He, Wenqing, Zhao, Xu, Zhang, Xuelin, Zeng, Xianfang, Liu, Xiaojia, Yang, Xuan, Dai, Yaqi, Yu, Yanbo, Li, Yang, Deng, Yineng, Wang, Yingming, Wang, Yilei, Lu, Yuanwei, Chen, Yu, Luo, Yu, Luo, Yuchu, Yin, Yuhe, Feng, Yuheng, Yang, Yuxiang, Tang, Zecheng, Zhang, Zekai, Yang, Zidong, Jiao, Binxing, Chen, Jiansheng, Li, Jing, Zhou, Shuchang, Zhang, Xiangyu, Zhang, Xinhao, Zhu, Yibo, Shum, Heung-Yeung, Jiang, Daxin
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.10248
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917935004516352
author Ma, Guoqing
Huang, Haoyang
Yan, Kun
Chen, Liangyu
Duan, Nan
Yin, Shengming
Wan, Changyi
Ming, Ranchen
Song, Xiaoniu
Chen, Xing
Zhou, Yu
Sun, Deshan
Zhou, Deyu
Zhou, Jian
Tan, Kaijun
An, Kang
Chen, Mei
Ji, Wei
Wu, Qiling
Sun, Wen
Han, Xin
Wei, Yanan
Ge, Zheng
Li, Aojie
Wang, Bin
Huang, Bizhu
Wang, Bo
Li, Brian
Miao, Changxing
Xu, Chen
Wu, Chenfei
Yu, Chenguang
Shi, Dapeng
Hu, Dingyuan
Liu, Enle
Yu, Gang
Yang, Ge
Huang, Guanzhe
Yan, Gulin
Feng, Haiyang
Nie, Hao
Jia, Haonan
Hu, Hanpeng
Chen, Hanqi
Yan, Haolong
Wang, Heng
Guo, Hongcheng
Xiong, Huilin
Xiong, Huixin
Gong, Jiahao
Wu, Jianchang
Wu, Jiaoren
Wu, Jie
Yang, Jie
Liu, Jiashuai
Li, Jiashuo
Zhang, Jingyang
Guo, Junjing
Lin, Junzhe
Li, Kaixiang
Liu, Lei
Xia, Lei
Zhao, Liang
Tan, Liguo
Huang, Liwen
Shi, Liying
Li, Ming
Li, Mingliang
Cheng, Muhua
Wang, Na
Chen, Qiaohui
He, Qinglin
Liang, Qiuyan
Sun, Quan
Sun, Ran
Wang, Rui
Pang, Shaoliang
Yang, Shiliang
Liu, Sitong
Liu, Siqi
Gao, Shuli
Cao, Tiancheng
Wang, Tianyu
Ming, Weipeng
He, Wenqing
Zhao, Xu
Zhang, Xuelin
Zeng, Xianfang
Liu, Xiaojia
Yang, Xuan
Dai, Yaqi
Yu, Yanbo
Li, Yang
Deng, Yineng
Wang, Yingming
Wang, Yilei
Lu, Yuanwei
Chen, Yu
Luo, Yu
Luo, Yuchu
Yin, Yuhe
Feng, Yuheng
Yang, Yuxiang
Tang, Zecheng
Zhang, Zekai
Yang, Zidong
Jiao, Binxing
Chen, Jiansheng
Li, Jing
Zhou, Shuchang
Zhang, Xiangyu
Zhang, Xinhao
Zhu, Yibo
Shum, Heung-Yeung
Jiang, Daxin
author_facet Ma, Guoqing
Huang, Haoyang
Yan, Kun
Chen, Liangyu
Duan, Nan
Yin, Shengming
Wan, Changyi
Ming, Ranchen
Song, Xiaoniu
Chen, Xing
Zhou, Yu
Sun, Deshan
Zhou, Deyu
Zhou, Jian
Tan, Kaijun
An, Kang
Chen, Mei
Ji, Wei
Wu, Qiling
Sun, Wen
Han, Xin
Wei, Yanan
Ge, Zheng
Li, Aojie
Wang, Bin
Huang, Bizhu
Wang, Bo
Li, Brian
Miao, Changxing
Xu, Chen
Wu, Chenfei
Yu, Chenguang
Shi, Dapeng
Hu, Dingyuan
Liu, Enle
Yu, Gang
Yang, Ge
Huang, Guanzhe
Yan, Gulin
Feng, Haiyang
Nie, Hao
Jia, Haonan
Hu, Hanpeng
Chen, Hanqi
Yan, Haolong
Wang, Heng
Guo, Hongcheng
Xiong, Huilin
Xiong, Huixin
Gong, Jiahao
Wu, Jianchang
Wu, Jiaoren
Wu, Jie
Yang, Jie
Liu, Jiashuai
Li, Jiashuo
Zhang, Jingyang
Guo, Junjing
Lin, Junzhe
Li, Kaixiang
Liu, Lei
Xia, Lei
Zhao, Liang
Tan, Liguo
Huang, Liwen
Shi, Liying
Li, Ming
Li, Mingliang
Cheng, Muhua
Wang, Na
Chen, Qiaohui
He, Qinglin
Liang, Qiuyan
Sun, Quan
Sun, Ran
Wang, Rui
Pang, Shaoliang
Yang, Shiliang
Liu, Sitong
Liu, Siqi
Gao, Shuli
Cao, Tiancheng
Wang, Tianyu
Ming, Weipeng
He, Wenqing
Zhao, Xu
Zhang, Xuelin
Zeng, Xianfang
Liu, Xiaojia
Yang, Xuan
Dai, Yaqi
Yu, Yanbo
Li, Yang
Deng, Yineng
Wang, Yingming
Wang, Yilei
Lu, Yuanwei
Chen, Yu
Luo, Yu
Luo, Yuchu
Yin, Yuhe
Feng, Yuheng
Yang, Yuxiang
Tang, Zecheng
Zhang, Zekai
Yang, Zidong
Jiao, Binxing
Chen, Jiansheng
Li, Jing
Zhou, Shuchang
Zhang, Xiangyu
Zhang, Xinhao
Zhu, Yibo
Shum, Heung-Yeung
Jiang, Daxin
contents We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
format Preprint
id arxiv_https___arxiv_org_abs_2502_10248
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Ma, Guoqing
Huang, Haoyang
Yan, Kun
Chen, Liangyu
Duan, Nan
Yin, Shengming
Wan, Changyi
Ming, Ranchen
Song, Xiaoniu
Chen, Xing
Zhou, Yu
Sun, Deshan
Zhou, Deyu
Zhou, Jian
Tan, Kaijun
An, Kang
Chen, Mei
Ji, Wei
Wu, Qiling
Sun, Wen
Han, Xin
Wei, Yanan
Ge, Zheng
Li, Aojie
Wang, Bin
Huang, Bizhu
Wang, Bo
Li, Brian
Miao, Changxing
Xu, Chen
Wu, Chenfei
Yu, Chenguang
Shi, Dapeng
Hu, Dingyuan
Liu, Enle
Yu, Gang
Yang, Ge
Huang, Guanzhe
Yan, Gulin
Feng, Haiyang
Nie, Hao
Jia, Haonan
Hu, Hanpeng
Chen, Hanqi
Yan, Haolong
Wang, Heng
Guo, Hongcheng
Xiong, Huilin
Xiong, Huixin
Gong, Jiahao
Wu, Jianchang
Wu, Jiaoren
Wu, Jie
Yang, Jie
Liu, Jiashuai
Li, Jiashuo
Zhang, Jingyang
Guo, Junjing
Lin, Junzhe
Li, Kaixiang
Liu, Lei
Xia, Lei
Zhao, Liang
Tan, Liguo
Huang, Liwen
Shi, Liying
Li, Ming
Li, Mingliang
Cheng, Muhua
Wang, Na
Chen, Qiaohui
He, Qinglin
Liang, Qiuyan
Sun, Quan
Sun, Ran
Wang, Rui
Pang, Shaoliang
Yang, Shiliang
Liu, Sitong
Liu, Siqi
Gao, Shuli
Cao, Tiancheng
Wang, Tianyu
Ming, Weipeng
He, Wenqing
Zhao, Xu
Zhang, Xuelin
Zeng, Xianfang
Liu, Xiaojia
Yang, Xuan
Dai, Yaqi
Yu, Yanbo
Li, Yang
Deng, Yineng
Wang, Yingming
Wang, Yilei
Lu, Yuanwei
Chen, Yu
Luo, Yu
Luo, Yuchu
Yin, Yuhe
Feng, Yuheng
Yang, Yuxiang
Tang, Zecheng
Zhang, Zekai
Yang, Zidong
Jiao, Binxing
Chen, Jiansheng
Li, Jing
Zhou, Shuchang
Zhang, Xiangyu
Zhang, Xinhao
Zhu, Yibo
Shum, Heung-Yeung
Jiang, Daxin
Computer Vision and Pattern Recognition
Computation and Language
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
title Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
topic Computer Vision and Pattern Recognition
Computation and Language
url https://arxiv.org/abs/2502.10248