Saved in:
| Main Authors: | Jiang, Daniel R., Bhandari, Jalaj, Yang, Yukai, Munos, Rémi, Lu, Tyler |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.21638 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Aligned Multi Objective Optimization
by: Efroni, Yonathan, et al.
Published: (2025)
by: Efroni, Yonathan, et al.
Published: (2025)
Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
by: Li, Junbo, et al.
Published: (2025)
by: Li, Junbo, et al.
Published: (2025)
Outcome-based Exploration for LLM Reasoning
by: Song, Yuda, et al.
Published: (2025)
by: Song, Yuda, et al.
Published: (2025)
On a few pitfalls in KL divergence gradient estimation for RL
by: Tang, Yunhao, et al.
Published: (2025)
by: Tang, Yunhao, et al.
Published: (2025)
Super-Exponential Regret for UCT, AlphaGo and Variants
by: Orseau, Laurent, et al.
Published: (2024)
by: Orseau, Laurent, et al.
Published: (2024)
Efficient RL Training for LLMs with Experience Replay
by: Arnal, Charles, et al.
Published: (2026)
by: Arnal, Charles, et al.
Published: (2026)
Bandits attack function optimization
by: Preux, Philippe, et al.
Published: (2026)
by: Preux, Philippe, et al.
Published: (2026)
Stochastic simultaneous optimistic optimization
by: Valko, Michal, et al.
Published: (2026)
by: Valko, Michal, et al.
Published: (2026)
RL-finetuning LLMs from on- and off-policy data with a single algorithm
by: Tang, Yunhao, et al.
Published: (2025)
by: Tang, Yunhao, et al.
Published: (2025)
Black-box optimization of noisy functions with unknown smoothness
by: Grill, Jean-Bastien, et al.
Published: (2026)
by: Grill, Jean-Bastien, et al.
Published: (2026)
Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning
by: Grill, Jean-Bastien, et al.
Published: (2026)
by: Grill, Jean-Bastien, et al.
Published: (2026)
Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data
by: Tang, Yunhao, et al.
Published: (2025)
by: Tang, Yunhao, et al.
Published: (2025)
Optimizing Language Models for Inference Time Objectives using Reinforcement Learning
by: Tang, Yunhao, et al.
Published: (2025)
by: Tang, Yunhao, et al.
Published: (2025)
Spectral Thompson sampling
by: Kocak, Tomas, et al.
Published: (2026)
by: Kocak, Tomas, et al.
Published: (2026)
VA-learning as a more efficient alternative to Q-learning
by: Tang, Yunhao, et al.
Published: (2023)
by: Tang, Yunhao, et al.
Published: (2023)
Spectral bandits for smooth graph functions
by: Valko, Michal, et al.
Published: (2026)
by: Valko, Michal, et al.
Published: (2026)
Efficient learning by implicit exploration in bandit problems with side observations
by: Kocak, Tomas, et al.
Published: (2026)
by: Kocak, Tomas, et al.
Published: (2026)
Spectral bandits for smooth graph functions with applications in recommender systems
by: Kocák, Tomáš, et al.
Published: (2026)
by: Kocák, Tomáš, et al.
Published: (2026)
Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling
by: Tang, Yunhao, et al.
Published: (2024)
by: Tang, Yunhao, et al.
Published: (2024)
Enhancing PPO with Trajectory-Aware Hybrid Policies
by: Liu, Qisai, et al.
Published: (2025)
by: Liu, Qisai, et al.
Published: (2025)
Mitigating Conversational Inertia in Multi-Turn Agents
by: Wan, Yang, et al.
Published: (2026)
by: Wan, Yang, et al.
Published: (2026)
Spectral bandits
by: Kocák, Tomáš, et al.
Published: (2026)
by: Kocák, Tomáš, et al.
Published: (2026)
Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
by: Zhang, Qingru, et al.
Published: (2025)
by: Zhang, Qingru, et al.
Published: (2025)
Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs
by: Lu, Yiyang, et al.
Published: (2026)
by: Lu, Yiyang, et al.
Published: (2026)
Eliciting Behaviors in Multi-Turn Conversations
by: Huang, Jing, et al.
Published: (2025)
by: Huang, Jing, et al.
Published: (2025)
Planning in entropy-regularized Markov decision processes and games
by: Grill, Jean-Bastien, et al.
Published: (2026)
by: Grill, Jean-Bastien, et al.
Published: (2026)
Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model
by: Rowland, Mark, et al.
Published: (2024)
by: Rowland, Mark, et al.
Published: (2024)
Building Math Agents with Multi-Turn Iterative Preference Learning
by: Xiong, Wei, et al.
Published: (2024)
by: Xiong, Wei, et al.
Published: (2024)
Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach
by: Zhang, Xinnan, et al.
Published: (2025)
by: Zhang, Xinnan, et al.
Published: (2025)
Sampling Complexity of TD and PPO in RKHS
by: Zou, Lu, et al.
Published: (2025)
by: Zou, Lu, et al.
Published: (2025)
Fix Initial Codes and Iteratively Refine Textual Directions Toward Safe Multi-Turn Code Correction
by: Tanaka, Yuto, et al.
Published: (2026)
by: Tanaka, Yuto, et al.
Published: (2026)
Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing
by: Yang, Ning, et al.
Published: (2026)
by: Yang, Ning, et al.
Published: (2026)
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
by: Arnal, Charles, et al.
Published: (2025)
by: Arnal, Charles, et al.
Published: (2025)
Temporal Difference Flows
by: Farebrother, Jesse, et al.
Published: (2025)
by: Farebrother, Jesse, et al.
Published: (2025)
Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs
by: Coalson, Zachary, et al.
Published: (2026)
by: Coalson, Zachary, et al.
Published: (2026)
VinePPO: Refining Credit Assignment in RL Training of LLMs
by: Kazemnejad, Amirhossein, et al.
Published: (2024)
by: Kazemnejad, Amirhossein, et al.
Published: (2024)
Directional-Clamp PPO
by: Karpel, Gilad, et al.
Published: (2025)
by: Karpel, Gilad, et al.
Published: (2025)
Soft Policy Optimization: Online Off-Policy RL for Sequence Models
by: Cohen, Taco, et al.
Published: (2025)
by: Cohen, Taco, et al.
Published: (2025)
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
by: Feldman, Shai, et al.
Published: (2026)
by: Feldman, Shai, et al.
Published: (2026)
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
by: Jaipersaud, Brandon, et al.
Published: (2025)
by: Jaipersaud, Brandon, et al.
Published: (2025)
Similar Items
-
Aligned Multi Objective Optimization
by: Efroni, Yonathan, et al.
Published: (2025) -
Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
by: Li, Junbo, et al.
Published: (2025) -
Outcome-based Exploration for LLM Reasoning
by: Song, Yuda, et al.
Published: (2025) -
On a few pitfalls in KL divergence gradient estimation for RL
by: Tang, Yunhao, et al.
Published: (2025) -
Super-Exponential Regret for UCT, AlphaGo and Variants
by: Orseau, Laurent, et al.
Published: (2024)