Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Jiaxiang, Feng, Zhida, Zou, Pengyu, Qian, Zhenyu, Zhu, Tianrui, Xia, Jun, Dong, Yuehu, Lin, Yanzheng, Xiong, Honglin, Chen, Anqi, Ding, Yunpeng, Duan, Jinghui, Gao, Lin, Han, Chao, He, Tiechao, Hu, Jiakang, Hua, Ranjun, Jiang, Xueming, Kong, Qingli, Lei, Yuting, Li, Tianyu, Liu, Yunlin, Liu, Changling, Liu, Yaxin, Liu, Yi, Liu, Xuguang, Ma, Xiaolong, Pan, Yan, Ren, Yiran, Sheng, Nan, Sun, Yu, Sun, Siyang, Tu, Yixiang, Wan, Yang, Wang, Huanai, Wang, Siqi, Wu, Yang, Yang, Youzhi, Yang, Xiaowen, Yang, Jianwen, Yang, Yehua, Zhang, Quanwen, Zhang, Xinmin, Zhang, Haoxin, Zhang, Xiang, Zhang, Jun, Zhang, Qian, Zhao, Qiao, Zhou, Qi
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2605.25347
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916044504825856
author	Liu, Jiaxiang Feng, Zhida Zou, Pengyu Qian, Zhenyu Zhu, Tianrui Xia, Jun Dong, Yuehu Lin, Yanzheng Xiong, Honglin Chen, Anqi Ding, Yunpeng Duan, Jinghui Gao, Lin Han, Chao He, Tiechao Hu, Jiakang Hua, Ranjun Jiang, Xueming Kong, Qingli Lei, Yuting Li, Tianyu Liu, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu, Xuguang Ma, Xiaolong Pan, Yan Ren, Yiran Sheng, Nan Sun, Yu Sun, Siyang Tu, Yixiang Wan, Yang Wang, Huanai Wang, Siqi Wu, Yang Yang, Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Zhang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang, Qian Zhao, Qiao Zhou, Qi
author_facet	Liu, Jiaxiang Feng, Zhida Zou, Pengyu Qian, Zhenyu Zhu, Tianrui Xia, Jun Dong, Yuehu Lin, Yanzheng Xiong, Honglin Chen, Anqi Ding, Yunpeng Duan, Jinghui Gao, Lin Han, Chao He, Tiechao Hu, Jiakang Hua, Ranjun Jiang, Xueming Kong, Qingli Lei, Yuting Li, Tianyu Liu, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu, Xuguang Ma, Xiaolong Pan, Yan Ren, Yiran Sheng, Nan Sun, Yu Sun, Siyang Tu, Yixiang Wan, Yang Wang, Huanai Wang, Siqi Wu, Yang Yang, Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Zhang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang, Qian Zhao, Qiao Zhou, Qi
contents	We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_25347
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ERNIE-Image Technical Report Liu, Jiaxiang Feng, Zhida Zou, Pengyu Qian, Zhenyu Zhu, Tianrui Xia, Jun Dong, Yuehu Lin, Yanzheng Xiong, Honglin Chen, Anqi Ding, Yunpeng Duan, Jinghui Gao, Lin Han, Chao He, Tiechao Hu, Jiakang Hua, Ranjun Jiang, Xueming Kong, Qingli Lei, Yuting Li, Tianyu Liu, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu, Xuguang Ma, Xiaolong Pan, Yan Ren, Yiran Sheng, Nan Sun, Yu Sun, Siyang Tu, Yixiang Wan, Yang Wang, Huanai Wang, Siqi Wu, Yang Yang, Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Zhang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang, Qian Zhao, Qiao Zhou, Qi Computer Vision and Pattern Recognition Machine Learning We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.
title	ERNIE-Image Technical Report
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2605.25347

Similar Items