:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Jiang, Daniel R., Bhandari, Jalaj, Yang, Yukai, Munos, Rémi, Lu, Tyler
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2511.21638
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Aligned Multi Objective Optimization
by: Efroni, Yonathan, et al.
Published: (2025)

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs
by: Li, Junbo, et al.
Published: (2025)

Outcome-based Exploration for LLM Reasoning
by: Song, Yuda, et al.
Published: (2025)

On a few pitfalls in KL divergence gradient estimation for RL
by: Tang, Yunhao, et al.
Published: (2025)

Super-Exponential Regret for UCT, AlphaGo and Variants
by: Orseau, Laurent, et al.
Published: (2024)

Efficient RL Training for LLMs with Experience Replay
by: Arnal, Charles, et al.
Published: (2026)

Bandits attack function optimization
by: Preux, Philippe, et al.
Published: (2026)

Stochastic simultaneous optimistic optimization
by: Valko, Michal, et al.
Published: (2026)

RL-finetuning LLMs from on- and off-policy data with a single algorithm
by: Tang, Yunhao, et al.
Published: (2025)

Black-box optimization of noisy functions with unknown smoothness
by: Grill, Jean-Bastien, et al.
Published: (2026)

Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning
by: Grill, Jean-Bastien, et al.
Published: (2026)

Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data
by: Tang, Yunhao, et al.
Published: (2025)

Optimizing Language Models for Inference Time Objectives using Reinforcement Learning
by: Tang, Yunhao, et al.
Published: (2025)

Spectral Thompson sampling
by: Kocak, Tomas, et al.
Published: (2026)

VA-learning as a more efficient alternative to Q-learning
by: Tang, Yunhao, et al.
Published: (2023)

Spectral bandits for smooth graph functions
by: Valko, Michal, et al.
Published: (2026)

Efficient learning by implicit exploration in bandit problems with side observations
by: Kocak, Tomas, et al.
Published: (2026)

Spectral bandits for smooth graph functions with applications in recommender systems
by: Kocák, Tomáš, et al.
Published: (2026)

Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling
by: Tang, Yunhao, et al.
Published: (2024)

Enhancing PPO with Trajectory-Aware Hybrid Policies
by: Liu, Qisai, et al.
Published: (2025)

Mitigating Conversational Inertia in Multi-Turn Agents
by: Wan, Yang, et al.
Published: (2026)

Spectral bandits
by: Kocák, Tomáš, et al.
Published: (2026)

Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
by: Zhang, Qingru, et al.
Published: (2025)

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs
by: Lu, Yiyang, et al.
Published: (2026)

Eliciting Behaviors in Multi-Turn Conversations
by: Huang, Jing, et al.
Published: (2025)

Planning in entropy-regularized Markov decision processes and games
by: Grill, Jean-Bastien, et al.
Published: (2026)

Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model
by: Rowland, Mark, et al.
Published: (2024)

Building Math Agents with Multi-Turn Iterative Preference Learning
by: Xiong, Wei, et al.
Published: (2024)

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach
by: Zhang, Xinnan, et al.
Published: (2025)

Sampling Complexity of TD and PPO in RKHS
by: Zou, Lu, et al.
Published: (2025)

Fix Initial Codes and Iteratively Refine Textual Directions Toward Safe Multi-Turn Code Correction
by: Tanaka, Yuto, et al.
Published: (2026)

Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing
by: Yang, Ning, et al.
Published: (2026)

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
by: Arnal, Charles, et al.
Published: (2025)

Temporal Difference Flows
by: Farebrother, Jesse, et al.
Published: (2025)

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs
by: Coalson, Zachary, et al.
Published: (2026)

VinePPO: Refining Credit Assignment in RL Training of LLMs
by: Kazemnejad, Amirhossein, et al.
Published: (2024)

Directional-Clamp PPO
by: Karpel, Gilad, et al.
Published: (2025)

Soft Policy Optimization: Online Off-Policy RL for Sequence Models
by: Cohen, Taco, et al.
Published: (2025)

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
by: Feldman, Shai, et al.
Published: (2026)

How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
by: Jaipersaud, Brandon, et al.
Published: (2025)