Saved in:
Bibliographic Details
Main Authors: Chen, Sixiang, Bai, Jinbin, Zhao, Zhuoran, Ye, Tian, Shi, Qingyu, Zhou, Donghao, Chai, Wenhao, Lin, Xin, Wu, Jianzong, Tang, Chao, Xu, Shilin, Zhang, Tao, Yuan, Haobo, Zhou, Yikang, Chow, Wei, Li, Linfeng, Li, Xiangtai, Zhu, Lei, Qi, Lu
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.05979
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913788345712640
author Chen, Sixiang
Bai, Jinbin
Zhao, Zhuoran
Ye, Tian
Shi, Qingyu
Zhou, Donghao
Chai, Wenhao
Lin, Xin
Wu, Jianzong
Tang, Chao
Xu, Shilin
Zhang, Tao
Yuan, Haobo
Zhou, Yikang
Chow, Wei
Li, Linfeng
Li, Xiangtai
Zhu, Lei
Qi, Lu
author_facet Chen, Sixiang
Bai, Jinbin
Zhao, Zhuoran
Ye, Tian
Shi, Qingyu
Zhou, Donghao
Chai, Wenhao
Lin, Xin
Wu, Jianzong
Tang, Chao
Xu, Shilin
Zhang, Tao
Yuan, Haobo
Zhou, Yikang
Chow, Wei
Li, Linfeng
Li, Xiangtai
Zhu, Lei
Qi, Lu
contents The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling. For a high-definition version of the PDF, please refer to the link on GitHub: \href{https://github.com/Ephemeral182/Empirical-Study-of-GPT-4o-Image-Gen}{https://github.com/Ephemeral182/Empirical-Study-of-GPT-4o-Image-Gen}.
format Preprint
id arxiv_https___arxiv_org_abs_2504_05979
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle An Empirical Study of GPT-4o Image Generation Capabilities
Chen, Sixiang
Bai, Jinbin
Zhao, Zhuoran
Ye, Tian
Shi, Qingyu
Zhou, Donghao
Chai, Wenhao
Lin, Xin
Wu, Jianzong
Tang, Chao
Xu, Shilin
Zhang, Tao
Yuan, Haobo
Zhou, Yikang
Chow, Wei
Li, Linfeng
Li, Xiangtai
Zhu, Lei
Qi, Lu
Computer Vision and Pattern Recognition
The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o's image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling. For a high-definition version of the PDF, please refer to the link on GitHub: \href{https://github.com/Ephemeral182/Empirical-Study-of-GPT-4o-Image-Gen}{https://github.com/Ephemeral182/Empirical-Study-of-GPT-4o-Image-Gen}.
title An Empirical Study of GPT-4o Image Generation Capabilities
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2504.05979