Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Koo, Ryan, Yang, Ian, Raheja, Vipul, Hong, Mingyi, Jun, Kwang-Sung, Kang, Dongyeop
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2504.16272
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912341682028544
author	Koo, Ryan Yang, Ian Raheja, Vipul Hong, Mingyi Jun, Kwang-Sung Kang, Dongyeop
author_facet	Koo, Ryan Yang, Ian Raheja, Vipul Hong, Mingyi Jun, Kwang-Sung Kang, Dongyeop
contents	Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_16272
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Learning Explainable Dense Reward Shapes via Bayesian Optimization Koo, Ryan Yang, Ian Raheja, Vipul Hong, Mingyi Jun, Kwang-Sung Kang, Dongyeop Machine Learning Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.
title	Learning Explainable Dense Reward Shapes via Bayesian Optimization
topic	Machine Learning
url	https://arxiv.org/abs/2504.16272

Similar Items