Saved in:
| Main Authors: | Liu, Tong, Qian, Cheng, Cief, Matej, He, Yuan, Dan, Daniele, Aletras, Nikolaos, Kazai, Gabriella |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2606.00135 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Bridging the Gap: From Ad-hoc to Proactive Search in Conversations
by: Meng, Chuan, et al.
Published: (2025)
by: Meng, Chuan, et al.
Published: (2025)
Pessimistic Off-Policy Optimization for Learning to Rank
by: Cief, Matej, et al.
Published: (2022)
by: Cief, Matej, et al.
Published: (2022)
Learning Action Embeddings for Off-Policy Evaluation
by: Cief, Matej, et al.
Published: (2023)
by: Cief, Matej, et al.
Published: (2023)
Incorporating Attribution Importance for Improving Faithfulness Metrics
by: Zhao, Zhixue, et al.
Published: (2023)
by: Zhao, Zhixue, et al.
Published: (2023)
Where does output diversity collapse in post-training?
by: Karouzos, Constantinos, et al.
Published: (2026)
by: Karouzos, Constantinos, et al.
Published: (2026)
An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift
by: Karouzos, Constantinos, et al.
Published: (2026)
by: Karouzos, Constantinos, et al.
Published: (2026)
ToolRL: Reward is All Tool Learning Needs
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch
by: Zeng, Yirong, et al.
Published: (2025)
by: Zeng, Yirong, et al.
Published: (2025)
On Designing Effective RL Reward at Training Time for LLM Reasoning
by: Gao, Jiaxuan, et al.
Published: (2024)
by: Gao, Jiaxuan, et al.
Published: (2024)
Cross-Validated Off-Policy Evaluation
by: Cief, Matej, et al.
Published: (2024)
by: Cief, Matej, et al.
Published: (2024)
Skill Reuse as Compression in Agentic RL
by: Xu, Zhikun, et al.
Published: (2026)
by: Xu, Zhikun, et al.
Published: (2026)
SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training
by: Wang, Prince Zizhuang, et al.
Published: (2026)
by: Wang, Prince Zizhuang, et al.
Published: (2026)
CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agentic RL
by: Mai, Shinji, et al.
Published: (2025)
by: Mai, Shinji, et al.
Published: (2025)
ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL
by: He, Zelin, et al.
Published: (2026)
by: He, Zelin, et al.
Published: (2026)
MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
by: Xu, Yifan, et al.
Published: (2025)
by: Xu, Yifan, et al.
Published: (2025)
ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
by: Zhoubian, Sining, et al.
Published: (2025)
by: Zhoubian, Sining, et al.
Published: (2025)
DiRL: An Efficient Post-Training Framework for Diffusion Language Models
by: Zhu, Ying, et al.
Published: (2025)
by: Zhu, Ying, et al.
Published: (2025)
RollArt: Scaling Agentic RL Training via Disaggregated Infrastructure
by: Gao, Wei, et al.
Published: (2025)
by: Gao, Wei, et al.
Published: (2025)
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
by: Hu, Yuelin, et al.
Published: (2026)
by: Hu, Yuelin, et al.
Published: (2026)
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
by: Li, Zhuofeng, et al.
Published: (2025)
by: Li, Zhuofeng, et al.
Published: (2025)
Reasoning Dynamics and the Limits of Monitoring Modality Reliance in Vision-Language Models
by: Villegas, Danae Sánchez, et al.
Published: (2026)
by: Villegas, Danae Sánchez, et al.
Published: (2026)
QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch
by: Gu, Hao, et al.
Published: (2026)
by: Gu, Hao, et al.
Published: (2026)
ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
by: Lu, Xingyu, et al.
Published: (2026)
by: Lu, Xingyu, et al.
Published: (2026)
Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions
by: Gan, Guo, et al.
Published: (2026)
by: Gan, Guo, et al.
Published: (2026)
Supplement Generation Training for Enhancing Agentic Task Performance
by: Cho, Young Min, et al.
Published: (2026)
by: Cho, Young Min, et al.
Published: (2026)
Topology-Aware Revival for Efficient Sparse Training
by: Jin, Meiling, et al.
Published: (2026)
by: Jin, Meiling, et al.
Published: (2026)
Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail
by: Li, Yingru, et al.
Published: (2025)
by: Li, Yingru, et al.
Published: (2025)
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
by: Ye, Chenlu, et al.
Published: (2025)
by: Ye, Chenlu, et al.
Published: (2025)
An Empirical Study on the Effectiveness of Incorporating Offline RL As Online RL Subroutines
by: Su, Jianhai, et al.
Published: (2025)
by: Su, Jianhai, et al.
Published: (2025)
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
by: Guo, Yiran, et al.
Published: (2025)
by: Guo, Yiran, et al.
Published: (2025)
GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control
by: Xu, Haofeng, et al.
Published: (2026)
by: Xu, Haofeng, et al.
Published: (2026)
What Limits Agentic Systems Efficiency?
by: Bian, Song, et al.
Published: (2025)
by: Bian, Song, et al.
Published: (2025)
Agentic Critical Training
by: Liu, Weize, et al.
Published: (2026)
by: Liu, Weize, et al.
Published: (2026)
UserRL: Training Interactive User-Centric Agent via Reinforcement Learning
by: Qian, Cheng, et al.
Published: (2025)
by: Qian, Cheng, et al.
Published: (2025)
Token-Efficient RL for LLM Reasoning
by: Lee, Alan, et al.
Published: (2025)
by: Lee, Alan, et al.
Published: (2025)
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
by: Dai, Weinan, et al.
Published: (2026)
by: Dai, Weinan, et al.
Published: (2026)
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
by: Li, Conglong, et al.
Published: (2022)
by: Li, Conglong, et al.
Published: (2022)
The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL
by: Li, Yingru, et al.
Published: (2026)
by: Li, Yingru, et al.
Published: (2026)
SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling
by: Zhang, Yiqi, et al.
Published: (2026)
by: Zhang, Yiqi, et al.
Published: (2026)
MindSpeed RL: Distributed Dataflow for Scalable and Efficient RL Training on Ascend NPU Cluster
by: Feng, Laingjun, et al.
Published: (2025)
by: Feng, Laingjun, et al.
Published: (2025)
Similar Items
-
Bridging the Gap: From Ad-hoc to Proactive Search in Conversations
by: Meng, Chuan, et al.
Published: (2025) -
Pessimistic Off-Policy Optimization for Learning to Rank
by: Cief, Matej, et al.
Published: (2022) -
Learning Action Embeddings for Off-Policy Evaluation
by: Cief, Matej, et al.
Published: (2023) -
Incorporating Attribution Importance for Improving Faithfulness Metrics
by: Zhao, Zhixue, et al.
Published: (2023) -
Where does output diversity collapse in post-training?
by: Karouzos, Constantinos, et al.
Published: (2026)