Saved in:
| Main Authors: | Min, Zijun, Liu, Bingshuai, Wang, Ante, Zhang, Long, Zeng, Anxiang, Zhang, Haibo, Su, Jinsong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.05607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts
by: Liu, Bingshuai, et al.
Published: (2025)
by: Liu, Bingshuai, et al.
Published: (2025)
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
by: Liu, Zhanyu, et al.
Published: (2026)
by: Liu, Zhanyu, et al.
Published: (2026)
ESPO: Entropy Importance Sampling Policy Optimization
by: Sheng, Yuepeng, et al.
Published: (2025)
by: Sheng, Yuepeng, et al.
Published: (2025)
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
by: Li, Meng, et al.
Published: (2025)
by: Li, Meng, et al.
Published: (2025)
PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR
by: Zhang, Yiqi, et al.
Published: (2026)
by: Zhang, Yiqi, et al.
Published: (2026)
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
by: Lochab, Anamika, et al.
Published: (2026)
by: Lochab, Anamika, et al.
Published: (2026)
CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
by: Cui, Sijia, et al.
Published: (2026)
by: Cui, Sijia, et al.
Published: (2026)
Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
by: Huang, Zhuoxu, et al.
Published: (2026)
by: Huang, Zhuoxu, et al.
Published: (2026)
Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
by: Liu, Bingshuai, et al.
Published: (2023)
by: Liu, Bingshuai, et al.
Published: (2023)
Data-Efficient RLVR via Off-Policy Influence Guidance
by: Zhu, Erle, et al.
Published: (2025)
by: Zhu, Erle, et al.
Published: (2025)
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
by: Chen, Kun, et al.
Published: (2026)
by: Chen, Kun, et al.
Published: (2026)
Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
by: Xu, Huimin, et al.
Published: (2026)
by: Xu, Huimin, et al.
Published: (2026)
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
by: Qu, Yun, et al.
Published: (2026)
by: Qu, Yun, et al.
Published: (2026)
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
by: Ye, Hao, et al.
Published: (2026)
by: Ye, Hao, et al.
Published: (2026)
Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets
by: Jia, Zijun, et al.
Published: (2025)
by: Jia, Zijun, et al.
Published: (2025)
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
by: Mao, Yixiu, et al.
Published: (2026)
by: Mao, Yixiu, et al.
Published: (2026)
Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval
by: Min, Zijun, et al.
Published: (2024)
by: Min, Zijun, et al.
Published: (2024)
Token-level Proximal Policy Optimization for Query Generation
by: Ouyang, Yichen, et al.
Published: (2024)
by: Ouyang, Yichen, et al.
Published: (2024)
Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR
by: He, Yuhang, et al.
Published: (2026)
by: He, Yuhang, et al.
Published: (2026)
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
by: Miao, Yuchun, et al.
Published: (2026)
by: Miao, Yuchun, et al.
Published: (2026)
Not only where, But when: Temporal Scheduling for RLVR
by: Zhang, Jinghao, et al.
Published: (2026)
by: Zhang, Jinghao, et al.
Published: (2026)
Linear Dynamics in the RLVR Training of Large Language Models
by: Wang, Tianle, et al.
Published: (2026)
by: Wang, Tianle, et al.
Published: (2026)
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
by: Liu, Fanfan, et al.
Published: (2026)
by: Liu, Fanfan, et al.
Published: (2026)
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
by: Heakl, Ahmed, et al.
Published: (2026)
by: Heakl, Ahmed, et al.
Published: (2026)
Route Experts by Sequence, not by Token
by: Wen, Tiansheng, et al.
Published: (2025)
by: Wen, Tiansheng, et al.
Published: (2025)
IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
by: He, Yinhan, et al.
Published: (2026)
by: He, Yinhan, et al.
Published: (2026)
PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
by: Nayak, Anupam, et al.
Published: (2026)
by: Nayak, Anupam, et al.
Published: (2026)
The Path Not Taken: RLVR Provably Learns Off the Principals
by: Zhu, Hanqing, et al.
Published: (2025)
by: Zhu, Hanqing, et al.
Published: (2025)
Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search
by: Liu, Rui, et al.
Published: (2025)
by: Liu, Rui, et al.
Published: (2025)
Self-Distilled RLVR
by: Yang, Chenxu, et al.
Published: (2026)
by: Yang, Chenxu, et al.
Published: (2026)
Group Sequence Policy Optimization
by: Zheng, Chujie, et al.
Published: (2025)
by: Zheng, Chujie, et al.
Published: (2025)
LiteSearch: Efficacious Tree Search for LLM
by: Wang, Ante, et al.
Published: (2024)
by: Wang, Ante, et al.
Published: (2024)
Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation
by: Ding, Fei, et al.
Published: (2026)
by: Ding, Fei, et al.
Published: (2026)
Feature-Aware One-Shot Federated Learning via Hierarchical Token Sequences
by: Liu, Shudong, et al.
Published: (2026)
by: Liu, Shudong, et al.
Published: (2026)
Soft Policy Optimization: Online Off-Policy RL for Sequence Models
by: Cohen, Taco, et al.
Published: (2025)
by: Cohen, Taco, et al.
Published: (2025)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement
by: Wen, Muning, et al.
Published: (2024)
by: Wen, Muning, et al.
Published: (2024)
Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
by: Gu, Hengrui, et al.
Published: (2026)
by: Gu, Hengrui, et al.
Published: (2026)
RLVR-World: Training World Models with Reinforcement Learning
by: Wu, Jialong, et al.
Published: (2025)
by: Wu, Jialong, et al.
Published: (2025)
LLM-Based Scientific Equation Discovery via Physics-Informed Token-Regularized Policy Optimization
by: Wang, Boxiao, et al.
Published: (2026)
by: Wang, Boxiao, et al.
Published: (2026)
Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
by: Wu, Junkang, et al.
Published: (2025)
by: Wu, Junkang, et al.
Published: (2025)
Similar Items
-
SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts
by: Liu, Bingshuai, et al.
Published: (2025) -
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
by: Liu, Zhanyu, et al.
Published: (2026) -
ESPO: Entropy Importance Sampling Policy Optimization
by: Sheng, Yuepeng, et al.
Published: (2025) -
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
by: Li, Meng, et al.
Published: (2025) -
PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR
by: Zhang, Yiqi, et al.
Published: (2026)