Saved in:
| Main Authors: | Li, Chang, Tsu, Tshihao, Zhang, Yaren, Xue, Chao, He, Xiaodong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.08239 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models
by: Li, Chengao, et al.
Published: (2025)
by: Li, Chengao, et al.
Published: (2025)
RePO: Replay-Enhanced Policy Optimization
by: Li, Siheng, et al.
Published: (2025)
by: Li, Siheng, et al.
Published: (2025)
Group Sequence Policy Optimization
by: Zheng, Chujie, et al.
Published: (2025)
by: Zheng, Chujie, et al.
Published: (2025)
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
by: Qi, Penghui, et al.
Published: (2025)
by: Qi, Penghui, et al.
Published: (2025)
Soft Adaptive Policy Optimization
by: Gao, Chang, et al.
Published: (2025)
by: Gao, Chang, et al.
Published: (2025)
Dataset Reset Policy Optimization for RLHF
by: Chang, Jonathan D., et al.
Published: (2024)
by: Chang, Jonathan D., et al.
Published: (2024)
COPO: Consistency-Aware Policy Optimization
by: Han, Jinghang, et al.
Published: (2025)
by: Han, Jinghang, et al.
Published: (2025)
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
by: Zhang, Xichen, et al.
Published: (2025)
by: Zhang, Xichen, et al.
Published: (2025)
Agentic Reinforced Policy Optimization
by: Dong, Guanting, et al.
Published: (2025)
by: Dong, Guanting, et al.
Published: (2025)
Causally-Enhanced Reinforcement Policy Optimization
by: Wang, Xiangqi, et al.
Published: (2025)
by: Wang, Xiangqi, et al.
Published: (2025)
ReDit: Reward Dithering for Improved LLM Policy Optimization
by: Wei, Chenxing, et al.
Published: (2025)
by: Wei, Chenxing, et al.
Published: (2025)
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
by: Su, Zhenpeng, et al.
Published: (2025)
by: Su, Zhenpeng, et al.
Published: (2025)
Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
by: Gao, Zhaolin, et al.
Published: (2024)
by: Gao, Zhaolin, et al.
Published: (2024)
EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
by: Han, Kevin, et al.
Published: (2026)
by: Han, Kevin, et al.
Published: (2026)
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
by: He, Junhui, et al.
Published: (2024)
by: He, Junhui, et al.
Published: (2024)
ConfPO: Exploiting Policy Model Confidence for Critical Token Selection in Preference Optimization
by: Yoon, Hee Suk, et al.
Published: (2025)
by: Yoon, Hee Suk, et al.
Published: (2025)
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
by: Lin, Nianyi, et al.
Published: (2025)
by: Lin, Nianyi, et al.
Published: (2025)
Agentic Policy Optimization via Instruction-Policy Co-Evolution
by: Zhou, Han, et al.
Published: (2025)
by: Zhou, Han, et al.
Published: (2025)
Adaptive Social Learning via Mode Policy Optimization for Language Agents
by: Wang, Minzheng, et al.
Published: (2025)
by: Wang, Minzheng, et al.
Published: (2025)
AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
by: Zhang, Songming, et al.
Published: (2025)
by: Zhang, Songming, et al.
Published: (2025)
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
by: Chen, Peter, et al.
Published: (2025)
by: Chen, Peter, et al.
Published: (2025)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
by: Xi, Zhiheng, et al.
Published: (2025)
by: Xi, Zhiheng, et al.
Published: (2025)
CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR
by: Cui, Sijia, et al.
Published: (2026)
by: Cui, Sijia, et al.
Published: (2026)
DCPO: Dynamic Clipping Policy Optimization
by: Yang, Shihui, et al.
Published: (2025)
by: Yang, Shihui, et al.
Published: (2025)
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
by: Yao, Xincheng, et al.
Published: (2026)
by: Yao, Xincheng, et al.
Published: (2026)
STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization
by: Chen, Yuhan, et al.
Published: (2025)
by: Chen, Yuhan, et al.
Published: (2025)
DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entries
by: Liu, Xiaodong
Published: (2024)
by: Liu, Xiaodong
Published: (2024)
Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
by: Wang, Ziyan, et al.
Published: (2025)
by: Wang, Ziyan, et al.
Published: (2025)
HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization
by: Huang, Chengyu, et al.
Published: (2025)
by: Huang, Chengyu, et al.
Published: (2025)
BinaryPPO: Efficient Policy Optimization for Binary Classification
by: Pandey, Punya Syon, et al.
Published: (2026)
by: Pandey, Punya Syon, et al.
Published: (2026)
Stepwise Alignment for Constrained Language Model Policy Optimization
by: Wachi, Akifumi, et al.
Published: (2024)
by: Wachi, Akifumi, et al.
Published: (2024)
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
by: Liu, Shih-Yang, et al.
Published: (2026)
by: Liu, Shih-Yang, et al.
Published: (2026)
PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning
by: Wu, Feijie, et al.
Published: (2025)
by: Wu, Feijie, et al.
Published: (2025)
Agentic Entropy-Balanced Policy Optimization
by: Dong, Guanting, et al.
Published: (2025)
by: Dong, Guanting, et al.
Published: (2025)
Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment
by: Wang, Jialu, et al.
Published: (2026)
by: Wang, Jialu, et al.
Published: (2026)
Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization
by: Khandoga, Mykola, et al.
Published: (2026)
by: Khandoga, Mykola, et al.
Published: (2026)
AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization
by: Wu, Junkang, et al.
Published: (2024)
by: Wu, Junkang, et al.
Published: (2024)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks
by: Yu, Xiaodong, et al.
Published: (2023)
by: Yu, Xiaodong, et al.
Published: (2023)
Co-Evolution of Policy and Internal Reward for Language Agents
by: Wang, Xinyu, et al.
Published: (2026)
by: Wang, Xinyu, et al.
Published: (2026)
Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning
by: Chen, Yanda, et al.
Published: (2024)
by: Chen, Yanda, et al.
Published: (2024)
Similar Items
-
Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models
by: Li, Chengao, et al.
Published: (2025) -
RePO: Replay-Enhanced Policy Optimization
by: Li, Siheng, et al.
Published: (2025) -
Group Sequence Policy Optimization
by: Zheng, Chujie, et al.
Published: (2025) -
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
by: Qi, Penghui, et al.
Published: (2025) -
Soft Adaptive Policy Optimization
by: Gao, Chang, et al.
Published: (2025)