Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Bhalerao, Parth, Yalamarty, Mounika, Trinh, Brian, Ignat, Oana
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.15972
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917416199520256
author	Bhalerao, Parth Yalamarty, Mounika Trinh, Brian Ignat, Oana
author_facet	Bhalerao, Parth Yalamarty, Mounika Trinh, Brian Ignat, Oana
contents	Text-to-image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text-to-image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state-of-the-art text-to-image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas. Our analysis shows that richer prompt composition can improve image quality and cultural grounding compared to simple prompts, while revealing substantial disparities across languages and demographic groups. We release our dataset and code at https://github.com/AIM-SCU/MosAIG.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_15972
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	When Cultures Meet: Multicultural Text-to-Image Generation Bhalerao, Parth Yalamarty, Mounika Trinh, Brian Ignat, Oana Computer Vision and Pattern Recognition Artificial Intelligence Text-to-image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text-to-image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state-of-the-art text-to-image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas. Our analysis shows that richer prompt composition can improve image quality and cultural grounding compared to simple prompts, while revealing substantial disparities across languages and demographic groups. We release our dataset and code at https://github.com/AIM-SCU/MosAIG.
title	When Cultures Meet: Multicultural Text-to-Image Generation
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2502.15972

Similar Items