Saved in:
Bibliographic Details
Main Authors: Shi, Jiabao, Qi, Minfeng, Zhang, Lefeng, Wang, Di, Zhao, Yingjie, Li, Ziying, Xing, Yalong, Li, Ningran
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.10633
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915548889088000
author Shi, Jiabao
Qi, Minfeng
Zhang, Lefeng
Wang, Di
Zhao, Yingjie
Li, Ziying
Xing, Yalong
Li, Ningran
author_facet Shi, Jiabao
Qi, Minfeng
Zhang, Lefeng
Wang, Di
Zhao, Yingjie
Li, Ziying
Xing, Yalong
Li, Ningran
contents Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.
format Preprint
id arxiv_https___arxiv_org_abs_2510_10633
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion
Shi, Jiabao
Qi, Minfeng
Zhang, Lefeng
Wang, Di
Zhao, Yingjie
Li, Ziying
Xing, Yalong
Li, Ningran
Artificial Intelligence
Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.
title Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion
topic Artificial Intelligence
url https://arxiv.org/abs/2510.10633