Saved in:
| Main Authors: | Ding, Yuyang, Zhang, Chi, Li, Juntao, Lin, Haibin, Zhang, Min |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.22543 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning
by: Ding, Yuyang, et al.
Published: (2025)
by: Ding, Yuyang, et al.
Published: (2025)
IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
by: He, Yinhan, et al.
Published: (2026)
by: He, Yinhan, et al.
Published: (2026)
LongFlow: Efficient KV Cache Compression for Reasoning Models
by: Su, Yi, et al.
Published: (2026)
by: Su, Yi, et al.
Published: (2026)
Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations
by: Ding, Yuyang, et al.
Published: (2025)
by: Ding, Yuyang, et al.
Published: (2025)
Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning
by: Kim, Junseok, et al.
Published: (2026)
by: Kim, Junseok, et al.
Published: (2026)
Large Reasoning Models Learn Better Alignment from Flawed Thinking
by: Peng, ShengYun, et al.
Published: (2025)
by: Peng, ShengYun, et al.
Published: (2025)
Calibration-Aware Policy Optimization for Reasoning LLMs
by: Wang, Ziqi, et al.
Published: (2026)
by: Wang, Ziqi, et al.
Published: (2026)
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
by: Zhao, Juntao, et al.
Published: (2024)
by: Zhao, Juntao, et al.
Published: (2024)
DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization
by: Li, Gang, et al.
Published: (2025)
by: Li, Gang, et al.
Published: (2025)
Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning
by: Ding, Zihan, et al.
Published: (2023)
by: Ding, Zihan, et al.
Published: (2023)
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
by: Qiu, Quantong, et al.
Published: (2026)
by: Qiu, Quantong, et al.
Published: (2026)
Uncertainty-Aware Data-Based Method for Fast and Reliable Shape Optimization
by: Yang, Yunjia, et al.
Published: (2026)
by: Yang, Yunjia, et al.
Published: (2026)
Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
by: Yang, Zhicheng, et al.
Published: (2026)
by: Yang, Zhicheng, et al.
Published: (2026)
Provably Efficient Exploration in Policy Optimization
by: Cai, Qi, et al.
Published: (2019)
by: Cai, Qi, et al.
Published: (2019)
Diversity-Aware Policy Optimization for Large Language Model Reasoning
by: Yao, Jian, et al.
Published: (2025)
by: Yao, Jian, et al.
Published: (2025)
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
by: Qi, Penghui, et al.
Published: (2025)
by: Qi, Penghui, et al.
Published: (2025)
Structured Role-Aware Policy Optimization for Multimodal Reasoning
by: Jiang, Bingqing, et al.
Published: (2026)
by: Jiang, Bingqing, et al.
Published: (2026)
WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning
by: Mundada, Gagan, et al.
Published: (2026)
by: Mundada, Gagan, et al.
Published: (2026)
STEP: Success-Rate-Aware Trajectory-Efficient Policy Optimization
by: Chen, Yuhan, et al.
Published: (2025)
by: Chen, Yuhan, et al.
Published: (2025)
Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning
by: Zhang, Yuyang, et al.
Published: (2026)
by: Zhang, Yuyang, et al.
Published: (2026)
Textual Gradients are a Flawed Metaphor for Automatic Prompt Optimization
by: Melcer, Daniel, et al.
Published: (2025)
by: Melcer, Daniel, et al.
Published: (2025)
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
by: Deng, Hexuan, et al.
Published: (2025)
by: Deng, Hexuan, et al.
Published: (2025)
HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization
by: Huang, Chengyu, et al.
Published: (2025)
by: Huang, Chengyu, et al.
Published: (2025)
ACT: Anti-Crosstalk Learning for Cross-Sectional Stock Ranking via Temporal Disentanglement and Structural Purification
by: Li, Juntao, et al.
Published: (2026)
by: Li, Juntao, et al.
Published: (2026)
Constrained Policy Optimization with Explicit Behavior Density for Offline Reinforcement Learning
by: Zhang, Jing, et al.
Published: (2023)
by: Zhang, Jing, et al.
Published: (2023)
Biased or Flawed? Mitigating Stereotypes in Generative Language Models by Addressing Task-Specific Flaws
by: Jha, Akshita, et al.
Published: (2024)
by: Jha, Akshita, et al.
Published: (2024)
Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
by: Li, Xuan, et al.
Published: (2026)
by: Li, Xuan, et al.
Published: (2026)
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
by: Liu, Xiaogeng, et al.
Published: (2026)
by: Liu, Xiaogeng, et al.
Published: (2026)
LEPO: Latent Reasoning Policy Optimization for Large Language Models
by: Zhou, Yuyan, et al.
Published: (2026)
by: Zhou, Yuyan, et al.
Published: (2026)
UCPO: Uncertainty-Aware Policy Optimization
by: Zeng, Xianzhou, et al.
Published: (2026)
by: Zeng, Xianzhou, et al.
Published: (2026)
Bridging SFT and RL: Dynamic Policy Optimization for Robust Reasoning
by: Zhu, Taojie, et al.
Published: (2026)
by: Zhu, Taojie, et al.
Published: (2026)
Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning
by: Xia, Yinan, et al.
Published: (2026)
by: Xia, Yinan, et al.
Published: (2026)
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
by: Yang, Junxiao, et al.
Published: (2025)
by: Yang, Junxiao, et al.
Published: (2025)
Can Large Reasoning Models Improve Accuracy on Mathematical Tasks Using Flawed Thinking?
by: Amjith, Saraswathy, et al.
Published: (2025)
by: Amjith, Saraswathy, et al.
Published: (2025)
Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization
by: Li, Jingwei, et al.
Published: (2026)
by: Li, Jingwei, et al.
Published: (2026)
PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning
by: Wu, Feijie, et al.
Published: (2025)
by: Wu, Feijie, et al.
Published: (2025)
Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization
by: Zhu, Haodong, et al.
Published: (2026)
by: Zhu, Haodong, et al.
Published: (2026)
QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices
by: Zhao, Juntao, et al.
Published: (2024)
by: Zhao, Juntao, et al.
Published: (2024)
COPO: Consistency-Aware Policy Optimization
by: Han, Jinghang, et al.
Published: (2025)
by: Han, Jinghang, et al.
Published: (2025)
PSPO*: An Effective Process-supervised Policy Optimization for Reasoning Alignment
by: Li, Jiawei, et al.
Published: (2024)
by: Li, Jiawei, et al.
Published: (2024)
Similar Items
-
SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning
by: Ding, Yuyang, et al.
Published: (2025) -
IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
by: He, Yinhan, et al.
Published: (2026) -
LongFlow: Efficient KV Cache Compression for Reasoning Models
by: Su, Yi, et al.
Published: (2026) -
Towards DS-NER: Unveiling and Addressing Latent Noise in Distant Annotations
by: Ding, Yuyang, et al.
Published: (2025) -
Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning
by: Kim, Junseok, et al.
Published: (2026)