Saved in:
Bibliographic Details
Main Authors: Zhang, Yuting, Sun, Ying, Shen, Dazhong, Xie, Ziwei, Liu, Feng, Zhang, Changwang, Liu, Xiang, Wang, Jun, Xiong, Hui
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.20434
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913054737825792
author Zhang, Yuting
Sun, Ying
Shen, Dazhong
Xie, Ziwei
Liu, Feng
Zhang, Changwang
Liu, Xiang
Wang, Jun
Xiong, Hui
author_facet Zhang, Yuting
Sun, Ying
Shen, Dazhong
Xie, Ziwei
Liu, Feng
Zhang, Changwang
Liu, Xiang
Wang, Jun
Xiong, Hui
contents The emergence of generative models enables the creation of texts and images tailored to users' preferences. Existing personalized generative models have two critical limitations: lacking a dedicated paradigm for accurate preference modeling, and generating unimodal content despite real-world multimodal-driven user interactions. Therefore, we propose personalized multimodal generation, which captures modal-specific preferences via a dedicated preference model from multimodal interactions, and then feeds them into downstream generators for personalized multimodal content. However, this task presents two challenges: (1) Gap between continuous preferences from dedicated modeling and discrete token inputs intrinsic to generator architectures; (2) Potential inconsistency between generated images and texts. To tackle these, we present a two-stage framework called Discrete Preference learning for Personalized Multimodal Generation (DPPMG). In the first stage, to accurately learn discrete modal-specific preferences, we introduce a modal-specific graph neural network (a dedicated preference model) to learn users' modal-specific preferences, which preferences are then quantized into discrete preference tokens. In the second stage, the discrete modal-specific preference tokens are injected into downstream text and image generators. To further enhance cross-modal consistency while preserving personalization, we design a cross-modal consistent and personalized reward to fine-tune token-associated parameters. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model in generating personalized and consistent multimodal content.
format Preprint
id arxiv_https___arxiv_org_abs_2604_20434
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Discrete Preference Learning for Personalized Multimodal Generation
Zhang, Yuting
Sun, Ying
Shen, Dazhong
Xie, Ziwei
Liu, Feng
Zhang, Changwang
Liu, Xiang
Wang, Jun
Xiong, Hui
Information Retrieval
The emergence of generative models enables the creation of texts and images tailored to users' preferences. Existing personalized generative models have two critical limitations: lacking a dedicated paradigm for accurate preference modeling, and generating unimodal content despite real-world multimodal-driven user interactions. Therefore, we propose personalized multimodal generation, which captures modal-specific preferences via a dedicated preference model from multimodal interactions, and then feeds them into downstream generators for personalized multimodal content. However, this task presents two challenges: (1) Gap between continuous preferences from dedicated modeling and discrete token inputs intrinsic to generator architectures; (2) Potential inconsistency between generated images and texts. To tackle these, we present a two-stage framework called Discrete Preference learning for Personalized Multimodal Generation (DPPMG). In the first stage, to accurately learn discrete modal-specific preferences, we introduce a modal-specific graph neural network (a dedicated preference model) to learn users' modal-specific preferences, which preferences are then quantized into discrete preference tokens. In the second stage, the discrete modal-specific preference tokens are injected into downstream text and image generators. To further enhance cross-modal consistency while preserving personalization, we design a cross-modal consistent and personalized reward to fine-tune token-associated parameters. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model in generating personalized and consistent multimodal content.
title Discrete Preference Learning for Personalized Multimodal Generation
topic Information Retrieval
url https://arxiv.org/abs/2604.20434