_version_ 1866916044504825856
author Liu, Jiaxiang
Feng, Zhida
Zou, Pengyu
Qian, Zhenyu
Zhu, Tianrui
Xia, Jun
Dong, Yuehu
Lin, Yanzheng
Xiong, Honglin
Chen, Anqi
Ding, Yunpeng
Duan, Jinghui
Gao, Lin
Han, Chao
He, Tiechao
Hu, Jiakang
Hua, Ranjun
Jiang, Xueming
Kong, Qingli
Lei, Yuting
Li, Tianyu
Liu, Yunlin
Liu, Changling
Liu, Yaxin
Liu, Yi
Liu, Xuguang
Ma, Xiaolong
Pan, Yan
Ren, Yiran
Sheng, Nan
Sun, Yu
Sun, Siyang
Tu, Yixiang
Wan, Yang
Wang, Huanai
Wang, Siqi
Wu, Yang
Yang, Youzhi
Yang, Xiaowen
Yang, Jianwen
Yang, Yehua
Zhang, Quanwen
Zhang, Xinmin
Zhang, Haoxin
Zhang, Xiang
Zhang, Jun
Zhang, Qian
Zhao, Qiao
Zhou, Qi
author_facet Liu, Jiaxiang
Feng, Zhida
Zou, Pengyu
Qian, Zhenyu
Zhu, Tianrui
Xia, Jun
Dong, Yuehu
Lin, Yanzheng
Xiong, Honglin
Chen, Anqi
Ding, Yunpeng
Duan, Jinghui
Gao, Lin
Han, Chao
He, Tiechao
Hu, Jiakang
Hua, Ranjun
Jiang, Xueming
Kong, Qingli
Lei, Yuting
Li, Tianyu
Liu, Yunlin
Liu, Changling
Liu, Yaxin
Liu, Yi
Liu, Xuguang
Ma, Xiaolong
Pan, Yan
Ren, Yiran
Sheng, Nan
Sun, Yu
Sun, Siyang
Tu, Yixiang
Wan, Yang
Wang, Huanai
Wang, Siqi
Wu, Yang
Yang, Youzhi
Yang, Xiaowen
Yang, Jianwen
Yang, Yehua
Zhang, Quanwen
Zhang, Xinmin
Zhang, Haoxin
Zhang, Xiang
Zhang, Jun
Zhang, Qian
Zhao, Qiao
Zhou, Qi
contents We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.
format Preprint
id arxiv_https___arxiv_org_abs_2605_25347
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle ERNIE-Image Technical Report
Liu, Jiaxiang
Feng, Zhida
Zou, Pengyu
Qian, Zhenyu
Zhu, Tianrui
Xia, Jun
Dong, Yuehu
Lin, Yanzheng
Xiong, Honglin
Chen, Anqi
Ding, Yunpeng
Duan, Jinghui
Gao, Lin
Han, Chao
He, Tiechao
Hu, Jiakang
Hua, Ranjun
Jiang, Xueming
Kong, Qingli
Lei, Yuting
Li, Tianyu
Liu, Yunlin
Liu, Changling
Liu, Yaxin
Liu, Yi
Liu, Xuguang
Ma, Xiaolong
Pan, Yan
Ren, Yiran
Sheng, Nan
Sun, Yu
Sun, Siyang
Tu, Yixiang
Wan, Yang
Wang, Huanai
Wang, Siqi
Wu, Yang
Yang, Youzhi
Yang, Xiaowen
Yang, Jianwen
Yang, Yehua
Zhang, Quanwen
Zhang, Xinmin
Zhang, Haoxin
Zhang, Xiang
Zhang, Jun
Zhang, Qian
Zhao, Qiao
Zhou, Qi
Computer Vision and Pattern Recognition
Machine Learning
We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.
title ERNIE-Image Technical Report
topic Computer Vision and Pattern Recognition
Machine Learning
url https://arxiv.org/abs/2605.25347