:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Qiu, Zhongxi, Zhang, Zhang, Hu, Yan, Li, Heng, Liu, Jiang
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.13950
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment
by: Liu, Zhanyu, et al.
Published: (2026)

VL Norm: Rethink Loss Aggregation in RLVR
by: He, Zhiyuan, et al.
Published: (2025)

Choosing How to Remember: Adaptive Memory Structures for LLM Agents
by: Lu, Mingfei, et al.
Published: (2026)

Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony
by: Lu, Han, et al.
Published: (2025)

RL in the Wild: Characterizing RLVR Training in LLM Deployment
by: Zhou, Jiecheng, et al.
Published: (2025)

Spurious Rewards: Rethinking Training Signals in RLVR
by: Shao, Rulin, et al.
Published: (2025)

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage
by: Li, Yuhan, et al.
Published: (2026)

RLVR-World: Training World Models with Reinforcement Learning
by: Wu, Jialong, et al.
Published: (2025)

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR
by: Ye, Hao, et al.
Published: (2026)

RLPR: Extrapolating RLVR to General Domains without Verifiers
by: Yu, Tianyu, et al.
Published: (2025)

Training Data Selection with Gradient Orthogonality for Efficient Domain Adaptation
by: Zhang, Xiyang, et al.
Published: (2026)

Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
by: Samineni, Soumya Rani, et al.
Published: (2025)

Monitorability as a Free Gift: How RLVR Spontaneously Aligns Reasoning
by: Xiong, Zidi, et al.
Published: (2026)

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
by: Cui, Sijia, et al.
Published: (2026)

When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
by: Miao, Yuchun, et al.
Published: (2026)

GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR
by: Zhang, Jiaying, et al.
Published: (2026)

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs
by: Khosravi, Hamed, et al.
Published: (2026)

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
by: Zhang, Mozhi, et al.
Published: (2025)

Quantile Advantage Estimation: Stabilizing RLVR for LLM Reasoning
by: Wu, Junkang, et al.
Published: (2025)

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR
by: Mou, Chaoli, et al.
Published: (2026)

Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
by: Chen, Kun, et al.
Published: (2026)

The Path Not Taken: RLVR Provably Learns Off the Principals
by: Zhu, Hanqing, et al.
Published: (2025)

R&B: Domain Regrouping and Data Mixture Balancing for Efficient Foundation Model Training
by: Ge, Albert, et al.
Published: (2025)

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
by: Duo, Jiangshan, et al.
Published: (2026)

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
by: Bauer, Justin, et al.
Published: (2026)

Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective
by: Hao, Zhezheng, et al.
Published: (2025)

Learning from Streaming Data when Users Choose
by: Su, Jinyan, et al.
Published: (2024)

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
by: Qu, Yun, et al.
Published: (2026)

Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
by: Huang, Zhuoxu, et al.
Published: (2026)

Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
by: Mitsuhashi, Ryo, et al.
Published: (2026)

The Debate on RLVR Reasoning Capability Boundary: Shrinkage, Expansion, or Both? A Two-Stage Dynamic View
by: Yao, Xinhao, et al.
Published: (2025)

Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook
by: Zou, Xingchen, et al.
Published: (2024)

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR
by: He, Yuhang, et al.
Published: (2026)

Adapting Amidst Degradation: Cross Domain Li-ion Battery Health Estimation via Physics-Guided Test-Time Training
by: Feng, Yuyuan, et al.
Published: (2024)

On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation
by: Huang, Kexin, et al.
Published: (2026)

Exploiting Block Coordinate Descent for Cost-Effective LLM Model Training
by: Liu, Zeyu, et al.
Published: (2025)

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
by: Shu, Yao, et al.
Published: (2026)

Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain
by: García-Ferrero, Iker, et al.
Published: (2024)

LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking
by: Helff, Lukas, et al.
Published: (2026)

The Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR
by: Adewuyi, Israel, et al.
Published: (2026)