Saved in:
| Main Authors: | Qiu, Zhongxi, Zhang, Zhang, Hu, Yan, Li, Heng, Liu, Jiang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.13950 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
by: Liu, Zhanyu, et al.
Published: (2026)
by: Liu, Zhanyu, et al.
Published: (2026)
VL Norm: Rethink Loss Aggregation in RLVR
by: He, Zhiyuan, et al.
Published: (2025)
by: He, Zhiyuan, et al.
Published: (2025)
Choosing How to Remember: Adaptive Memory Structures for LLM Agents
by: Lu, Mingfei, et al.
Published: (2026)
by: Lu, Mingfei, et al.
Published: (2026)
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony
by: Lu, Han, et al.
Published: (2025)
by: Lu, Han, et al.
Published: (2025)
RL in the Wild: Characterizing RLVR Training in LLM Deployment
by: Zhou, Jiecheng, et al.
Published: (2025)
by: Zhou, Jiecheng, et al.
Published: (2025)
Spurious Rewards: Rethinking Training Signals in RLVR
by: Shao, Rulin, et al.
Published: (2025)
by: Shao, Rulin, et al.
Published: (2025)
IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage
by: Li, Yuhan, et al.
Published: (2026)
by: Li, Yuhan, et al.
Published: (2026)
RLVR-World: Training World Models with Reinforcement Learning
by: Wu, Jialong, et al.
Published: (2025)
by: Wu, Jialong, et al.
Published: (2025)
On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
by: Ye, Hao, et al.
Published: (2026)
by: Ye, Hao, et al.
Published: (2026)
RLPR: Extrapolating RLVR to General Domains without Verifiers
by: Yu, Tianyu, et al.
Published: (2025)
by: Yu, Tianyu, et al.
Published: (2025)
Training Data Selection with Gradient Orthogonality for Efficient Domain Adaptation
by: Zhang, Xiyang, et al.
Published: (2026)
by: Zhang, Xiyang, et al.
Published: (2026)
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
by: Samineni, Soumya Rani, et al.
Published: (2025)
by: Samineni, Soumya Rani, et al.
Published: (2025)
Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
by: Xiong, Zidi, et al.
Published: (2026)
by: Xiong, Zidi, et al.
Published: (2026)
CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
by: Cui, Sijia, et al.
Published: (2026)
by: Cui, Sijia, et al.
Published: (2026)
When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
by: Miao, Yuchun, et al.
Published: (2026)
by: Miao, Yuchun, et al.
Published: (2026)
GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR
by: Zhang, Jiaying, et al.
Published: (2026)
by: Zhang, Jiaying, et al.
Published: (2026)
Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs
by: Khosravi, Hamed, et al.
Published: (2026)
by: Khosravi, Hamed, et al.
Published: (2026)
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
by: Zhang, Mozhi, et al.
Published: (2025)
by: Zhang, Mozhi, et al.
Published: (2025)
Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
by: Wu, Junkang, et al.
Published: (2025)
by: Wu, Junkang, et al.
Published: (2025)
Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
by: Mou, Chaoli, et al.
Published: (2026)
by: Mou, Chaoli, et al.
Published: (2026)
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
by: Chen, Kun, et al.
Published: (2026)
by: Chen, Kun, et al.
Published: (2026)
The Path Not Taken: RLVR Provably Learns Off the Principals
by: Zhu, Hanqing, et al.
Published: (2025)
by: Zhu, Hanqing, et al.
Published: (2025)
R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
by: Ge, Albert, et al.
Published: (2025)
by: Ge, Albert, et al.
Published: (2025)
JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
by: Duo, Jiangshan, et al.
Published: (2026)
by: Duo, Jiangshan, et al.
Published: (2026)
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
by: Bauer, Justin, et al.
Published: (2026)
by: Bauer, Justin, et al.
Published: (2026)
Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
by: Hao, Zhezheng, et al.
Published: (2025)
by: Hao, Zhezheng, et al.
Published: (2025)
Learning from Streaming Data when Users Choose
by: Su, Jinyan, et al.
Published: (2024)
by: Su, Jinyan, et al.
Published: (2024)
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
by: Qu, Yun, et al.
Published: (2026)
by: Qu, Yun, et al.
Published: (2026)
Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
by: Huang, Zhuoxu, et al.
Published: (2026)
by: Huang, Zhuoxu, et al.
Published: (2026)
Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
by: Mitsuhashi, Ryo, et al.
Published: (2026)
by: Mitsuhashi, Ryo, et al.
Published: (2026)
The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View
by: Yao, Xinhao, et al.
Published: (2025)
by: Yao, Xinhao, et al.
Published: (2025)
Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook
by: Zou, Xingchen, et al.
Published: (2024)
by: Zou, Xingchen, et al.
Published: (2024)
Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR
by: He, Yuhang, et al.
Published: (2026)
by: He, Yuhang, et al.
Published: (2026)
Adapting Amidst Degradation: Cross Domain Li-ion Battery Health Estimation via Physics-Guided Test-Time Training
by: Feng, Yuyuan, et al.
Published: (2024)
by: Feng, Yuyuan, et al.
Published: (2024)
On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
by: Huang, Kexin, et al.
Published: (2026)
by: Huang, Kexin, et al.
Published: (2026)
Exploiting Block Coordinate Descent for Cost-Effective LLM Model Training
by: Liu, Zeyu, et al.
Published: (2025)
by: Liu, Zeyu, et al.
Published: (2025)
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
by: Shu, Yao, et al.
Published: (2026)
by: Shu, Yao, et al.
Published: (2026)
Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain
by: García-Ferrero, Iker, et al.
Published: (2024)
by: García-Ferrero, Iker, et al.
Published: (2024)
LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking
by: Helff, Lukas, et al.
Published: (2026)
by: Helff, Lukas, et al.
Published: (2026)
The Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR
by: Adewuyi, Israel, et al.
Published: (2026)
by: Adewuyi, Israel, et al.
Published: (2026)
Similar Items
-
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
by: Liu, Zhanyu, et al.
Published: (2026) -
VL Norm: Rethink Loss Aggregation in RLVR
by: He, Zhiyuan, et al.
Published: (2025) -
Choosing How to Remember: Adaptive Memory Structures for LLM Agents
by: Lu, Mingfei, et al.
Published: (2026) -
Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony
by: Lu, Han, et al.
Published: (2025) -
RL in the Wild: Characterizing RLVR Training in LLM Deployment
by: Zhou, Jiecheng, et al.
Published: (2025)