Saved in:
Bibliographic Details
Main Authors: Cao, Pu, Zhou, Feng, Ji, Junyi, Kong, Qingye, Lv, Zhixiang, Zhang, Mingjian, Zhao, Xuekun, Wu, Siqi, Lin, Yinghui, Song, Qing, Yang, Lu
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.05501
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916728090394624
author Cao, Pu
Zhou, Feng
Ji, Junyi
Kong, Qingye
Lv, Zhixiang
Zhang, Mingjian
Zhao, Xuekun
Wu, Siqi
Lin, Yinghui
Song, Qing
Yang, Lu
author_facet Cao, Pu
Zhou, Feng
Ji, Junyi
Kong, Qingye
Lv, Zhixiang
Zhang, Mingjian
Zhao, Xuekun
Wu, Siqi
Lin, Yinghui
Song, Qing
Yang, Lu
contents Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o's powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model's outputs but also probe deeper into GPT-4o's understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.
format Preprint
id arxiv_https___arxiv_org_abs_2505_05501
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Preliminary Explorations with GPT-4o(mni) Native Image Generation
Cao, Pu
Zhou, Feng
Ji, Junyi
Kong, Qingye
Lv, Zhixiang
Zhang, Mingjian
Zhao, Xuekun
Wu, Siqi
Lin, Yinghui
Song, Qing
Yang, Lu
Computer Vision and Pattern Recognition
Artificial Intelligence
Image and Video Processing
Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o's powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model's outputs but also probe deeper into GPT-4o's understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.
title Preliminary Explorations with GPT-4o(mni) Native Image Generation
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Image and Video Processing
url https://arxiv.org/abs/2505.05501