Saved in:
| Main Authors: | Padula, Alexander G., Soemers, Dennis J. N. J. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.17126 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
On Designing Effective RL Reward at Training Time for LLM Reasoning
by: Gao, Jiaxuan, et al.
Published: (2024)
by: Gao, Jiaxuan, et al.
Published: (2024)
FlowRL: Matching Reward Distributions for LLM Reasoning
by: Zhu, Xuekai, et al.
Published: (2025)
by: Zhu, Xuekai, et al.
Published: (2025)
Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents
by: Li, Zelong, et al.
Published: (2024)
by: Li, Zelong, et al.
Published: (2024)
Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards
by: Pavlenko, Kirill, et al.
Published: (2026)
by: Pavlenko, Kirill, et al.
Published: (2026)
RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
by: Feng, Xiao, et al.
Published: (2026)
by: Feng, Xiao, et al.
Published: (2026)
On the Optimal Reasoning Length for RL-Trained Language Models
by: Nohara, Daisuke, et al.
Published: (2026)
by: Nohara, Daisuke, et al.
Published: (2026)
$Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
by: Zhou, Jin Peng, et al.
Published: (2025)
by: Zhou, Jin Peng, et al.
Published: (2025)
ToolRL: Reward is All Tool Learning Needs
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning
by: Wu, Yang, et al.
Published: (2024)
by: Wu, Yang, et al.
Published: (2024)
Exploring Domain Robust Lightweight Reward Models based on Router Mechanism
by: Namgoong, Hyuk, et al.
Published: (2024)
by: Namgoong, Hyuk, et al.
Published: (2024)
Emergent Representations of Program Semantics in Language Models Trained on Programs
by: Jin, Charles, et al.
Published: (2023)
by: Jin, Charles, et al.
Published: (2023)
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL
by: Zhou, Yifei, et al.
Published: (2024)
by: Zhou, Yifei, et al.
Published: (2024)
Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL
by: Zhou, Yixiao, et al.
Published: (2026)
by: Zhou, Yixiao, et al.
Published: (2026)
GLIDE-RL: Grounded Language Instruction through DEmonstration in RL
by: Kharyal, Chaitanya, et al.
Published: (2024)
by: Kharyal, Chaitanya, et al.
Published: (2024)
Feedback Loops With Language Models Drive In-Context Reward Hacking
by: Pan, Alexander, et al.
Published: (2024)
by: Pan, Alexander, et al.
Published: (2024)
Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management
by: Lu, Miao, et al.
Published: (2025)
by: Lu, Miao, et al.
Published: (2025)
Synthetic Data RL: Task Definition Is All You Need
by: Guo, Yiduo, et al.
Published: (2025)
by: Guo, Yiduo, et al.
Published: (2025)
Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards
by: Li, Ming, et al.
Published: (2025)
by: Li, Ming, et al.
Published: (2025)
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
by: Li, Kenneth, et al.
Published: (2022)
by: Li, Kenneth, et al.
Published: (2022)
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
by: Xi, Zhiheng, et al.
Published: (2025)
by: Xi, Zhiheng, et al.
Published: (2025)
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
by: Yu, Hongli, et al.
Published: (2025)
by: Yu, Hongli, et al.
Published: (2025)
Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?
by: Renduchintala, H S V N S Kowndinya, et al.
Published: (2026)
by: Renduchintala, H S V N S Kowndinya, et al.
Published: (2026)
Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training
by: Liu, Mingjie, et al.
Published: (2025)
by: Liu, Mingjie, et al.
Published: (2025)
BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning
by: Hage, Tarjei Paule, et al.
Published: (2026)
by: Hage, Tarjei Paule, et al.
Published: (2026)
Learning to Reason as Action Abstractions with Scalable Mid-Training RL
by: Zhang, Shenao, et al.
Published: (2025)
by: Zhang, Shenao, et al.
Published: (2025)
Ludax: A GPU-Accelerated Domain Specific Language for Board Games
by: Todd, Graham, et al.
Published: (2025)
by: Todd, Graham, et al.
Published: (2025)
AIOS Compiler: LLM as Interpreter for Natural Language Programming and Flow Programming of AI Agents
by: Xu, Shuyuan, et al.
Published: (2024)
by: Xu, Shuyuan, et al.
Published: (2024)
Multilinguality in LLM-Designed Reward Functions for Restless Bandits: Effects on Task Performance and Fairness
by: Parthasarathy, Ambreesh, et al.
Published: (2025)
by: Parthasarathy, Ambreesh, et al.
Published: (2025)
Words as Beacons: Guiding RL Agents with High-Level Language Prompts
by: Ruiz-Gonzalez, Unai, et al.
Published: (2024)
by: Ruiz-Gonzalez, Unai, et al.
Published: (2024)
SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
by: Limozin, Alexis, et al.
Published: (2026)
by: Limozin, Alexis, et al.
Published: (2026)
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation
by: Yang, Zhuolin, et al.
Published: (2026)
by: Yang, Zhuolin, et al.
Published: (2026)
Automated Rewards via LLM-Generated Progress Functions
by: Sarukkai, Vishnu, et al.
Published: (2024)
by: Sarukkai, Vishnu, et al.
Published: (2024)
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
by: Damani, Mehul, et al.
Published: (2025)
by: Damani, Mehul, et al.
Published: (2025)
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models
by: Akhauri, Yash, et al.
Published: (2024)
by: Akhauri, Yash, et al.
Published: (2024)
The RL/LLM Taxonomy Tree: Reviewing Synergies Between Reinforcement Learning and Large Language Models
by: Pternea, Moschoula, et al.
Published: (2024)
by: Pternea, Moschoula, et al.
Published: (2024)
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
by: Djuhera, Aladin, et al.
Published: (2026)
by: Djuhera, Aladin, et al.
Published: (2026)
Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks
by: Shen, Chen, et al.
Published: (2026)
by: Shen, Chen, et al.
Published: (2026)
Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training
by: Saha, Rohan, et al.
Published: (2024)
by: Saha, Rohan, et al.
Published: (2024)
Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training
by: Ilin, Aleksei, et al.
Published: (2025)
by: Ilin, Aleksei, et al.
Published: (2025)
UserRL: Training Interactive User-Centric Agent via Reinforcement Learning
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
Similar Items
-
On Designing Effective RL Reward at Training Time for LLM Reasoning
by: Gao, Jiaxuan, et al.
Published: (2024) -
FlowRL: Matching Reward Distributions for LLM Reasoning
by: Zhu, Xuekai, et al.
Published: (2025) -
Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents
by: Li, Zelong, et al.
Published: (2024) -
Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards
by: Pavlenko, Kirill, et al.
Published: (2026) -
RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
by: Feng, Xiao, et al.
Published: (2026)