Saved in:
| Main Authors: | Yang, Zhe, Wang, Yudong, Li, Rang, Sui, Zhifang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.22765 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning
by: Wang, Yudong, et al.
Published: (2025)
by: Wang, Yudong, et al.
Published: (2025)
Exploring Activation Patterns of Parameters in Language Models
by: Wang, Yudong, et al.
Published: (2024)
by: Wang, Yudong, et al.
Published: (2024)
A Probabilistic Inference Scaling Theory for LLM Self-Correction
by: Yang, Zhe, et al.
Published: (2025)
by: Yang, Zhe, et al.
Published: (2025)
Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
by: Yang, Zhe, et al.
Published: (2024)
by: Yang, Zhe, et al.
Published: (2024)
Not All Demonstration Examples are Equally Beneficial: Reweighting Demonstration Examples for In-Context Learning
by: Yang, Zhe, et al.
Published: (2023)
by: Yang, Zhe, et al.
Published: (2023)
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
by: Iso, Hayate, et al.
Published: (2026)
by: Iso, Hayate, et al.
Published: (2026)
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
by: Ma, Wenhan, et al.
Published: (2025)
by: Ma, Wenhan, et al.
Published: (2025)
Plug-and-Play Training Framework for Preference Optimization
by: Ma, Jingyuan, et al.
Published: (2024)
by: Ma, Jingyuan, et al.
Published: (2024)
From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling
by: Chen, Zhengyu, et al.
Published: (2025)
by: Chen, Zhengyu, et al.
Published: (2025)
CoLT: Reasoning with Chain of Latent Tool Calls
by: Zhu, Fangwei, et al.
Published: (2026)
by: Zhu, Fangwei, et al.
Published: (2026)
SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts
by: Liu, Bingshuai, et al.
Published: (2025)
by: Liu, Bingshuai, et al.
Published: (2025)
Reducing Hallucinations in Entity Abstract Summarization with Facts-Template Decomposition
by: Zhu, Fangwei, et al.
Published: (2024)
by: Zhu, Fangwei, et al.
Published: (2024)
RICo: Refined In-Context Contribution for Automatic Instruction-Tuning Data Selection
by: Yang, Yixin, et al.
Published: (2025)
by: Yang, Yixin, et al.
Published: (2025)
Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow
by: Xi, Haocheng, et al.
Published: (2026)
by: Xi, Haocheng, et al.
Published: (2026)
Reinforcement Pre-Training
by: Dong, Qingxiu, et al.
Published: (2025)
by: Dong, Qingxiu, et al.
Published: (2025)
Chain-of-Thought Tokens are Computer Program Variables
by: Zhu, Fangwei, et al.
Published: (2025)
by: Zhu, Fangwei, et al.
Published: (2025)
Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?
by: Yang, Zhe, et al.
Published: (2024)
by: Yang, Zhe, et al.
Published: (2024)
Towards Harmonized Uncertainty Estimation for Large Language Models
by: Li, Rui, et al.
Published: (2025)
by: Li, Rui, et al.
Published: (2025)
Language Models Encode the Value of Numbers Linearly
by: Zhu, Fangwei, et al.
Published: (2024)
by: Zhu, Fangwei, et al.
Published: (2024)
Prune as You Generate: Online Rollout Pruning for Faster and Better RLVR
by: Xu, Haobo, et al.
Published: (2026)
by: Xu, Haobo, et al.
Published: (2026)
Self-Boosting Large Language Models with Synthetic Preference Data
by: Dong, Qingxiu, et al.
Published: (2024)
by: Dong, Qingxiu, et al.
Published: (2024)
FSM: A Finite State Machine Based Zero-Shot Prompting Paradigm for Multi-Hop Question Answering
by: Wang, Xiaochen, et al.
Published: (2024)
by: Wang, Xiaochen, et al.
Published: (2024)
Stable and Efficient Single-Rollout RL for Multimodal Reasoning
by: Liu, Rui, et al.
Published: (2025)
by: Liu, Rui, et al.
Published: (2025)
Can Large Multimodal Models Uncover Deep Semantics Behind Images?
by: Yang, Yixin, et al.
Published: (2024)
by: Yang, Yixin, et al.
Published: (2024)
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
by: Xia, Heming, et al.
Published: (2024)
by: Xia, Heming, et al.
Published: (2024)
Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
by: Luo, Sijia, et al.
Published: (2026)
by: Luo, Sijia, et al.
Published: (2026)
HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness
by: Wang, Xinyi, et al.
Published: (2025)
by: Wang, Xinyi, et al.
Published: (2025)
SG-FSM: A Self-Guiding Zero-Shot Prompting Paradigm for Multi-Hop Question Answering Based on Finite State Machine
by: Wang, Xiaochen, et al.
Published: (2024)
by: Wang, Xiaochen, et al.
Published: (2024)
HistLens: Mapping Idea Change across Concepts and Corpora
by: Jing, Yi, et al.
Published: (2026)
by: Jing, Yi, et al.
Published: (2026)
TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
by: Djuhera, Aladin, et al.
Published: (2026)
by: Djuhera, Aladin, et al.
Published: (2026)
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming
by: Li, Rui, et al.
Published: (2025)
by: Li, Rui, et al.
Published: (2025)
PeriodicLoRA: Breaking the Low-Rank Bottleneck in LoRA Optimization
by: Meng, Xiangdi, et al.
Published: (2024)
by: Meng, Xiangdi, et al.
Published: (2024)
AlpaGasus: Training A Better Alpaca with Fewer Data
by: Chen, Lichang, et al.
Published: (2023)
by: Chen, Lichang, et al.
Published: (2023)
Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
by: Zhang, Di, et al.
Published: (2026)
by: Zhang, Di, et al.
Published: (2026)
Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees
by: Li, Kun, et al.
Published: (2026)
by: Li, Kun, et al.
Published: (2026)
$V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts
by: Zhang, Yi-Kai, et al.
Published: (2026)
by: Zhang, Yi-Kai, et al.
Published: (2026)
First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training
by: Wei, Lai, et al.
Published: (2025)
by: Wei, Lai, et al.
Published: (2025)
Chip-Tuning: Classify Before Language Models Say
by: Zhu, Fangwei, et al.
Published: (2024)
by: Zhu, Fangwei, et al.
Published: (2024)
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
by: Qian, Yusu, et al.
Published: (2024)
by: Qian, Yusu, et al.
Published: (2024)
Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts
by: Zhang, Di, et al.
Published: (2025)
by: Zhang, Di, et al.
Published: (2025)
Similar Items
-
Enhancing Reliability across Short and Long-Form QA via Reinforcement Learning
by: Wang, Yudong, et al.
Published: (2025) -
Exploring Activation Patterns of Parameters in Language Models
by: Wang, Yudong, et al.
Published: (2024) -
A Probabilistic Inference Scaling Theory for LLM Self-Correction
by: Yang, Zhe, et al.
Published: (2025) -
Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
by: Yang, Zhe, et al.
Published: (2024) -
Not All Demonstration Examples are Equally Beneficial: Reweighting Demonstration Examples for In-Context Learning
by: Yang, Zhe, et al.
Published: (2023)