Saved in:
Bibliographic Details
Main Authors: Bhalerao, Parth, Yalamarty, Mounika, Trinh, Brian, Ignat, Oana
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.15972
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917416199520256
author Bhalerao, Parth
Yalamarty, Mounika
Trinh, Brian
Ignat, Oana
author_facet Bhalerao, Parth
Yalamarty, Mounika
Trinh, Brian
Ignat, Oana
contents Text-to-image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text-to-image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state-of-the-art text-to-image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas. Our analysis shows that richer prompt composition can improve image quality and cultural grounding compared to simple prompts, while revealing substantial disparities across languages and demographic groups. We release our dataset and code at https://github.com/AIM-SCU/MosAIG.
format Preprint
id arxiv_https___arxiv_org_abs_2502_15972
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle When Cultures Meet: Multicultural Text-to-Image Generation
Bhalerao, Parth
Yalamarty, Mounika
Trinh, Brian
Ignat, Oana
Computer Vision and Pattern Recognition
Artificial Intelligence
Text-to-image generation models have achieved strong performance in culturally homogeneous settings, yet their ability to generate multicultural scenes, where people and landmarks originate from different cultures, remains largely unexplored. We introduce multicultural text-to-image generation as a new task and present the first benchmark designed to study this setting. Our dataset contains 9,000 images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages. Using this benchmark, we analyze the behavior of state-of-the-art text-to-image models across multiple dimensions, including alignment, image quality, aesthetics, knowledge, and fairness. As one strategy for composing cultural and demographic information, we explore MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas. Our analysis shows that richer prompt composition can improve image quality and cultural grounding compared to simple prompts, while revealing substantial disparities across languages and demographic groups. We release our dataset and code at https://github.com/AIM-SCU/MosAIG.
title When Cultures Meet: Multicultural Text-to-Image Generation
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2502.15972