:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Min, Zijun, Liu, Bingshuai, Wang, Ante, Zhang, Long, Zeng, Anxiang, Zhang, Haibo, Su, Jinsong
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2601.05607
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts
by: Liu, Bingshuai, et al.
Published: (2025)

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
by: Liu, Zhanyu, et al.
Published: (2026)

ESPO: Entropy Importance Sampling Policy Optimization
by: Sheng, Yuepeng, et al.
Published: (2025)

Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
by: Li, Meng, et al.
Published: (2025)

PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR
by: Zhang, Yiqi, et al.
Published: (2026)

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
by: Lochab, Anamika, et al.
Published: (2026)

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
by: Cui, Sijia, et al.
Published: (2026)

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
by: Huang, Zhuoxu, et al.
Published: (2026)

Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
by: Liu, Bingshuai, et al.
Published: (2023)

Data-Efficient RLVR via Off-Policy Influence Guidance
by: Zhu, Erle, et al.
Published: (2025)

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
by: Chen, Kun, et al.
Published: (2026)

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization
by: Xu, Huimin, et al.
Published: (2026)

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
by: Qu, Yun, et al.
Published: (2026)

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
by: Ye, Hao, et al.
Published: (2026)

Coverage-Guaranteed Speech Emotion Recognition via Calibrated Uncertainty-Adaptive Prediction Sets
by: Jia, Zijun, et al.
Published: (2025)

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
by: Mao, Yixiu, et al.
Published: (2026)

Exploring Optimal Transport-Based Multi-Grained Alignments for Text-Molecule Retrieval
by: Min, Zijun, et al.
Published: (2024)

Token-level Proximal Policy Optimization for Query Generation
by: Ouyang, Yichen, et al.
Published: (2024)

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR
by: He, Yuhang, et al.
Published: (2026)

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
by: Miao, Yuchun, et al.
Published: (2026)

Not only where, But when: Temporal Scheduling for RLVR
by: Zhang, Jinghao, et al.
Published: (2026)

Linear Dynamics in the RLVR Training of Large Language Models
by: Wang, Tianle, et al.
Published: (2026)

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
by: Liu, Fanfan, et al.
Published: (2026)

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
by: Heakl, Ahmed, et al.
Published: (2026)

Route Experts by Sequence, not by Token
by: Wen, Tiansheng, et al.
Published: (2025)

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
by: He, Yinhan, et al.
Published: (2026)

PubSwap: Public-Data Off-Policy Coordination for Federated RLVR
by: Nayak, Anupam, et al.
Published: (2026)

The Path Not Taken: RLVR Provably Learns Off the Principals
by: Zhu, Hanqing, et al.
Published: (2025)

Continuous Optimization for Feature Selection with Permutation-Invariant Embedding and Policy-Guided Search
by: Liu, Rui, et al.
Published: (2025)

Self-Distilled RLVR
by: Yang, Chenxu, et al.
Published: (2026)

Group Sequence Policy Optimization
by: Zheng, Chujie, et al.
Published: (2025)

LiteSearch: Efficacious Tree Search for LLM
by: Wang, Ante, et al.
Published: (2024)

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation
by: Ding, Fei, et al.
Published: (2026)

Feature-Aware One-Shot Federated Learning via Hierarchical Token Sequences
by: Liu, Shudong, et al.
Published: (2026)

Soft Policy Optimization: Online Off-Policy RL for Sequence Models
by: Cohen, Taco, et al.
Published: (2025)

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement
by: Wen, Muning, et al.
Published: (2024)

Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR
by: Gu, Hengrui, et al.
Published: (2026)

RLVR-World: Training World Models with Reinforcement Learning
by: Wu, Jialong, et al.
Published: (2025)

LLM-Based Scientific Equation Discovery via Physics-Informed Token-Regularized Policy Optimization
by: Wang, Boxiao, et al.
Published: (2026)

Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
by: Wu, Junkang, et al.
Published: (2025)