Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cao, Pu, Zhou, Feng, Ji, Junyi, Kong, Qingye, Lv, Zhixiang, Zhang, Mingjian, Zhao, Xuekun, Wu, Siqi, Lin, Yinghui, Song, Qing, Yang, Lu
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Image and Video Processing
Online Access:	https://arxiv.org/abs/2505.05501
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916728090394624
author	Cao, Pu Zhou, Feng Ji, Junyi Kong, Qingye Lv, Zhixiang Zhang, Mingjian Zhao, Xuekun Wu, Siqi Lin, Yinghui Song, Qing Yang, Lu
author_facet	Cao, Pu Zhou, Feng Ji, Junyi Kong, Qingye Lv, Zhixiang Zhang, Mingjian Zhao, Xuekun Wu, Siqi Lin, Yinghui Song, Qing Yang, Lu
contents	Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o's powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model's outputs but also probe deeper into GPT-4o's understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_05501
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Preliminary Explorations with GPT-4o(mni) Native Image Generation Cao, Pu Zhou, Feng Ji, Junyi Kong, Qingye Lv, Zhixiang Zhang, Mingjian Zhao, Xuekun Wu, Siqi Lin, Yinghui Song, Qing Yang, Lu Computer Vision and Pattern Recognition Artificial Intelligence Image and Video Processing Recently, the visual generation ability by GPT-4o(mni) has been unlocked by OpenAI. It demonstrates a very remarkable generation capability with excellent multimodal condition understanding and varied task instructions. In this paper, we aim to explore the capabilities of GPT-4o across various tasks. Inspired by previous study, we constructed a task taxonomy along with a carefully curated set of test samples to conduct a comprehensive qualitative test. Benefiting from GPT-4o's powerful multimodal comprehension, its image-generation process demonstrates abilities surpassing those of traditional image-generation tasks. Thus, regarding the dimensions of model capabilities, we evaluate its performance across six task categories: traditional image generation tasks, discriminative tasks, knowledge-based generation, commonsense-based generation, spatially-aware image generation, and temporally-aware image generation. These tasks not only assess the quality and conditional alignment of the model's outputs but also probe deeper into GPT-4o's understanding of real-world concepts. Our results reveal that GPT-4o performs impressively well in general-purpose synthesis tasks, showing strong capabilities in text-to-image generation, visual stylization, and low-level image processing. However, significant limitations remain in its ability to perform precise spatial reasoning, instruction-grounded generation, and consistent temporal prediction. Furthermore, when faced with knowledge-intensive or domain-specific scenarios, such as scientific illustrations or mathematical plots, the model often exhibits hallucinations, factual errors, or structural inconsistencies. These findings suggest that while GPT-4o marks a substantial advancement in unified multimodal generation, there is still a long way to go before it can be reliably applied to professional or safety-critical domains.
title	Preliminary Explorations with GPT-4o(mni) Native Image Generation
topic	Computer Vision and Pattern Recognition Artificial Intelligence Image and Video Processing
url	https://arxiv.org/abs/2505.05501

Similar Items