Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Chenghao, Tao, Meiling, Wang, Tiannan, Ding, Dongyi, Jiang, Yuchen Eleanor, Zhou, Wangchunshu
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.18849
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917030777585664
author	Zhu, Chenghao Tao, Meiling Wang, Tiannan Ding, Dongyi Jiang, Yuchen Eleanor Zhou, Wangchunshu
author_facet	Zhu, Chenghao Tao, Meiling Wang, Tiannan Ding, Dongyi Jiang, Yuchen Eleanor Zhou, Wangchunshu
contents	Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_18849
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning Zhu, Chenghao Tao, Meiling Wang, Tiannan Ding, Dongyi Jiang, Yuchen Eleanor Zhou, Wangchunshu Computation and Language Artificial Intelligence Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
title	Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2510.18849

Similar Items