Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Heng, Zen Kit, Zhao, Zimeng, Wu, Tianhao, Wang, Yuanfei, Wu, Mingdong, Wang, Yangang, Dong, Hao
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.07596
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913788397092864
author	Heng, Zen Kit Zhao, Zimeng Wu, Tianhao Wang, Yuanfei Wu, Mingdong Wang, Yangang Dong, Hao
author_facet	Heng, Zen Kit Zhao, Zimeng Wu, Tianhao Wang, Yuanfei Wu, Mingdong Wang, Yangang Dong, Hao
contents	Large Language Models (LLMs) are emerging as promising tools for automated reinforcement learning (RL) reward design, owing to their robust capabilities in commonsense reasoning and code generation. By engaging in dialogues with RL agents, LLMs construct a Reward Observation Space (ROS) by selecting relevant environment states and defining their internal operations. However, existing frameworks have not effectively leveraged historical exploration data or manual task descriptions to iteratively evolve this space. In this paper, we propose a novel heuristic framework that enhances LLM-driven reward design by evolving the ROS through a table-based exploration caching mechanism and a text-code reconciliation strategy. Our framework introduces a state execution table, which tracks the historical usage and success rates of environment states, overcoming the Markovian constraint typically found in LLM dialogues and facilitating more effective exploration. Furthermore, we reconcile user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives. Comprehensive evaluations on benchmark RL tasks demonstrate the effectiveness and stability of the proposed framework. Code and video demos are available at jingjjjjjie.github.io/LLM2Reward.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_07596
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution Heng, Zen Kit Zhao, Zimeng Wu, Tianhao Wang, Yuanfei Wu, Mingdong Wang, Yangang Dong, Hao Artificial Intelligence Large Language Models (LLMs) are emerging as promising tools for automated reinforcement learning (RL) reward design, owing to their robust capabilities in commonsense reasoning and code generation. By engaging in dialogues with RL agents, LLMs construct a Reward Observation Space (ROS) by selecting relevant environment states and defining their internal operations. However, existing frameworks have not effectively leveraged historical exploration data or manual task descriptions to iteratively evolve this space. In this paper, we propose a novel heuristic framework that enhances LLM-driven reward design by evolving the ROS through a table-based exploration caching mechanism and a text-code reconciliation strategy. Our framework introduces a state execution table, which tracks the historical usage and success rates of environment states, overcoming the Markovian constraint typically found in LLM dialogues and facilitating more effective exploration. Furthermore, we reconcile user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives. Comprehensive evaluations on benchmark RL tasks demonstrate the effectiveness and stability of the proposed framework. Code and video demos are available at jingjjjjjie.github.io/LLM2Reward.
title	Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution
topic	Artificial Intelligence
url	https://arxiv.org/abs/2504.07596

Similar Items