Saved in:
| Main Authors: | Mahrooghi, Ilia, Lotfi, Aryo, Abbe, Emmanuel |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.14868 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RL for Reasoning by Adaptively Revealing Rationales
by: Amani, Mohammad Hossein, et al.
Published: (2025)
by: Amani, Mohammad Hossein, et al.
Published: (2025)
How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad
by: Abbe, Emmanuel, et al.
Published: (2024)
by: Abbe, Emmanuel, et al.
Published: (2024)
Generalization on the Unseen, Logic Reasoning and Degree Curriculum
by: Abbe, Emmanuel, et al.
Published: (2023)
by: Abbe, Emmanuel, et al.
Published: (2023)
Chain-of-Sketch: Enabling Global Visual Reasoning
by: Lotfi, Aryo, et al.
Published: (2024)
by: Lotfi, Aryo, et al.
Published: (2024)
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
by: Qu, Yun, et al.
Published: (2025)
by: Qu, Yun, et al.
Published: (2025)
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
by: Brilliantov, Kirill, et al.
Published: (2026)
by: Brilliantov, Kirill, et al.
Published: (2026)
Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling
by: Li, Derek, et al.
Published: (2025)
by: Li, Derek, et al.
Published: (2025)
FlowRL: Matching Reward Distributions for LLM Reasoning
by: Zhu, Xuekai, et al.
Published: (2025)
by: Zhu, Xuekai, et al.
Published: (2025)
Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners
by: Muslimani, Calarina, et al.
Published: (2025)
by: Muslimani, Calarina, et al.
Published: (2025)
STO-RL: Offline RL under Sparse Rewards via LLM-Guided Subgoal Temporal Order
by: Gu, Chengyang, et al.
Published: (2026)
by: Gu, Chengyang, et al.
Published: (2026)
Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization
by: Cao, Wenjun
Published: (2025)
by: Cao, Wenjun
Published: (2025)
On Designing Effective RL Reward at Training Time for LLM Reasoning
by: Gao, Jiaxuan, et al.
Published: (2024)
by: Gao, Jiaxuan, et al.
Published: (2024)
Prompt-Tuned LLM-Augmented DRL for Dynamic O-RAN Network Slicing
by: Lotfi, Fatemeh, et al.
Published: (2025)
by: Lotfi, Fatemeh, et al.
Published: (2025)
When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL
by: Wang, Youting, et al.
Published: (2026)
by: Wang, Youting, et al.
Published: (2026)
To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models
by: Malach, Eran, et al.
Published: (2025)
by: Malach, Eran, et al.
Published: (2025)
Escaping the Verifier: Learning to Reason via Demonstrations
by: Cai, Locke, et al.
Published: (2025)
by: Cai, Locke, et al.
Published: (2025)
Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards
by: Padula, Alexander G., et al.
Published: (2024)
by: Padula, Alexander G., et al.
Published: (2024)
RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs
by: Fernandez, Nigel, et al.
Published: (2025)
by: Fernandez, Nigel, et al.
Published: (2025)
Task Specific Sharpness Aware O-RAN Resource Management using Multi Agent Reinforcement Learning
by: Lotfi, Fatemeh, et al.
Published: (2025)
by: Lotfi, Fatemeh, et al.
Published: (2025)
Hard Prompts Made Interpretable: Sparse Entropy Regularization for Prompt Tuning with RL
by: Choi, Yunseon, et al.
Published: (2024)
by: Choi, Yunseon, et al.
Published: (2024)
Intrinsic Reward Policy Optimization for Sparse-Reward Environments
by: Cho, Minjae, et al.
Published: (2026)
by: Cho, Minjae, et al.
Published: (2026)
Attention-Based Reward Shaping for Sparse and Delayed Rewards
by: Holmes, Ian, et al.
Published: (2025)
by: Holmes, Ian, et al.
Published: (2025)
Shorter Thoughts, Same Answers: Difficulty-Scaled Segment-Wise RL for CoT Compression
by: Tian, Ye, et al.
Published: (2026)
by: Tian, Ye, et al.
Published: (2026)
Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
by: Liu, Siyuan, et al.
Published: (2026)
by: Liu, Siyuan, et al.
Published: (2026)
Optimizing Reasoning Efficiency through Prompt Difficulty Prediction
by: Zhao, Bo, et al.
Published: (2025)
by: Zhao, Bo, et al.
Published: (2025)
Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone
by: Mark, Max Sobol, et al.
Published: (2024)
by: Mark, Max Sobol, et al.
Published: (2024)
A Comprehensive Study of Supervised Machine Learning Models for Zero-Day Attack Detection: Analyzing Performance on Imbalanced Data
by: Lotfi, Zahra, et al.
Published: (2025)
by: Lotfi, Zahra, et al.
Published: (2025)
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning
by: Leang, Joshua Ong Jun, et al.
Published: (2024)
by: Leang, Joshua Ong Jun, et al.
Published: (2024)
VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL
by: Hu, Zengjie, et al.
Published: (2025)
by: Hu, Zengjie, et al.
Published: (2025)
Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS
by: Jin, Can, et al.
Published: (2025)
by: Jin, Can, et al.
Published: (2025)
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models
by: Shen, Yi, et al.
Published: (2025)
by: Shen, Yi, et al.
Published: (2025)
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
by: Qi, Xuan, et al.
Published: (2025)
by: Qi, Xuan, et al.
Published: (2025)
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
by: Salimi, Moein, et al.
Published: (2026)
by: Salimi, Moein, et al.
Published: (2026)
LEAD: Breaking the No-Recovery Bottleneck in Long-Horizon Reasoning
by: Pushkin, Denys, et al.
Published: (2026)
by: Pushkin, Denys, et al.
Published: (2026)
What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?
by: Shihab, Ibne Farabi, et al.
Published: (2025)
by: Shihab, Ibne Farabi, et al.
Published: (2025)
When can transformers reason with abstract symbols?
by: Boix-Adsera, Enric, et al.
Published: (2023)
by: Boix-Adsera, Enric, et al.
Published: (2023)
Wireless Federated Multi-Task LLM Fine-Tuning via Sparse-and-Orthogonal LoRA
by: Yang, Nuocheng, et al.
Published: (2026)
by: Yang, Nuocheng, et al.
Published: (2026)
Less is more? Rewards in RL for Cyber Defence
by: Bates, Elizabeth, et al.
Published: (2025)
by: Bates, Elizabeth, et al.
Published: (2025)
ProgAgent:A Continual RL Agent with Progress-Aware Rewards
by: Tan, Jinzhou, et al.
Published: (2026)
by: Tan, Jinzhou, et al.
Published: (2026)
Verifier-Free RL for LLMs via Intrinsic Gradient-Norm Reward
by: Wen, Xuexiang, et al.
Published: (2026)
by: Wen, Xuexiang, et al.
Published: (2026)
Similar Items
-
RL for Reasoning by Adaptively Revealing Rationales
by: Amani, Mohammad Hossein, et al.
Published: (2025) -
How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad
by: Abbe, Emmanuel, et al.
Published: (2024) -
Generalization on the Unseen, Logic Reasoning and Degree Curriculum
by: Abbe, Emmanuel, et al.
Published: (2023) -
Chain-of-Sketch: Enabling Global Visual Reasoning
by: Lotfi, Aryo, et al.
Published: (2024) -
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
by: Qu, Yun, et al.
Published: (2025)