Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Ling, Zhang, Xinchen, Tian, Ye, Shang, Chenming, Xu, Minghao, Zhang, Wentao, Cui, Bin
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.12148
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914055044726784
author	Yang, Ling Zhang, Xinchen Tian, Ye Shang, Chenming Xu, Minghao Zhang, Wentao Cui, Bin
author_facet	Yang, Ling Zhang, Xinchen Tian, Ye Shang, Chenming Xu, Minghao Zhang, Wentao Cui, Bin
contents	The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_12148
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation Yang, Ling Zhang, Xinchen Tian, Ye Shang, Chenming Xu, Minghao Zhang, Wentao Cui, Bin Computer Vision and Pattern Recognition The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow
title	HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.12148

Similar Items