Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Su, Zhaolong, Lu, Wang, Chen, Hao, Li, Sharon, Wang, Jindong
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2511.19413
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866912987166539776
author	Su, Zhaolong Lu, Wang Chen, Hao Li, Sharon Wang, Jindong
author_facet	Su, Zhaolong Lu, Wang Chen, Hao Li, Sharon Wang, Jindong
contents	Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02)on GenEval, out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/TorchUMM
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_19413
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	UniGame: Turning a Unified Multimodal Model Into Its Own Adversary Su, Zhaolong Lu, Wang Chen, Hao Li, Sharon Wang, Jindong Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02)on GenEval, out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/TorchUMM
title	UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
topic	Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2511.19413

Ähnliche Einträge