Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lu, Renjie, Zhang, Xulong, Qu, Xiaoyang, Wang, Shangfei, Wang, Jianzong
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2605.25328
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914598907543552
author	Lu, Renjie Zhang, Xulong Qu, Xiaoyang Wang, Shangfei Wang, Jianzong
author_facet	Lu, Renjie Zhang, Xulong Qu, Xiaoyang Wang, Shangfei Wang, Jianzong
contents	Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge that lies in inductive biases induced by distinct supervision signals: generation branch prefers high-fidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: https://github.com/Jayyy-H/DIVA.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_25328
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement Lu, Renjie Zhang, Xulong Qu, Xiaoyang Wang, Shangfei Wang, Jianzong Computer Vision and Pattern Recognition Multimedia Unified Multimodal models (UMMs) built on a single architecture have shown impressive performance in both understanding and generation. We identify a fundamental challenge that lies in inductive biases induced by distinct supervision signals: generation branch prefers high-fidelity, fine-grained representations capable of reconstruction, while the understanding favours semantically discriminative embeddings that remain invariant to task-irrelevant factors. Consequently, optimizing these complementary but non-equivalent objectives within a monolithic backbone leads to mutual impairment instead of enhancement. In this paper, we first analyze the root cause of this interference in unified backbones and reveal a complementary structure in their internal representations. Motivated by the observation, we propose DIVA, a self-improved post-training framework that transforms the representation divergence into interior synergy. By explicitly factorizing the visual representation into shared and unique components based on two complementary information flow, DIVA enables both the understanding and generation branches to achieve beneficial transferring while preserving the integrity of unique information from cross-flow interference via mutual information estimation. Despite its generality, our method consistently achieves improvements across visual understanding (+7.82%) and generation (+8.46%). The official code is available at: https://github.com/Jayyy-H/DIVA.
title	DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement
topic	Computer Vision and Pattern Recognition Multimedia
url	https://arxiv.org/abs/2605.25328

Similar Items