Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shi, Jiabao, Qi, Minfeng, Zhang, Lefeng, Wang, Di, Zhao, Yingjie, Li, Ziying, Xing, Yalong, Li, Ningran
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.10633
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915548889088000
author	Shi, Jiabao Qi, Minfeng Zhang, Lefeng Wang, Di Zhao, Yingjie Li, Ziying Xing, Yalong Li, Ningran
author_facet	Shi, Jiabao Qi, Minfeng Zhang, Lefeng Wang, Di Zhao, Yingjie Li, Ziying Xing, Yalong Li, Ningran
contents	Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_10633
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion Shi, Jiabao Qi, Minfeng Zhang, Lefeng Wang, Di Zhao, Yingjie Li, Ziying Xing, Yalong Li, Ningran Artificial Intelligence Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that coordinates domain-specialized agents (e.g., focused on architecture, portraiture, and landscape imagery) within two coupled subsystems: a text enhancement module and an image generation module, each augmented with multimodal integration components. Agents are trained using Proximal Policy Optimization (PPO) under a composite reward function that balances semantic similarity, linguistic visual quality, and content diversity. Cross-modal alignment is enforced through contrastive learning, bidirectional attention, and iterative feedback between text and image. Across six experimental settings, our system significantly enriches generated content (word count increased by 1614%) while reducing ROUGE-1 scores by 69.7%. Among fusion methods, Transformer-based strategies achieve the highest composite score (0.521), despite occasional stability issues. Multimodal ensembles yield moderate consistency (ranging from 0.444 to 0.481), reflecting the persistent challenges of cross-modal semantic grounding. These findings underscore the promise of collaborative, specialization-driven architectures for advancing reliable multimodal generative systems.
title	Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion
topic	Artificial Intelligence
url	https://arxiv.org/abs/2510.10633

Similar Items