Saved in:
| Main Author: | Jia, Chen |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.12867 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment
by: Xu, Wenzhe, et al.
Published: (2026)
by: Xu, Wenzhe, et al.
Published: (2026)
Flattening Hierarchies with Policy Bootstrapping
by: Zhou, John L., et al.
Published: (2025)
by: Zhou, John L., et al.
Published: (2025)
Mitigating Preference Hacking in Policy Optimization with Pessimism
by: Gupta, Dhawal, et al.
Published: (2025)
by: Gupta, Dhawal, et al.
Published: (2025)
Aligning CodeLLMs with Direct Preference Optimization
by: Miao, Yibo, et al.
Published: (2024)
by: Miao, Yibo, et al.
Published: (2024)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
by: Lai, Xin, et al.
Published: (2024)
by: Lai, Xin, et al.
Published: (2024)
Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning
by: Zhang, Tianle, et al.
Published: (2024)
by: Zhang, Tianle, et al.
Published: (2024)
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences
by: Nishimori, Soichiro, et al.
Published: (2025)
by: Nishimori, Soichiro, et al.
Published: (2025)
Reflective Preference Optimization (RPO): Enhancing On-Policy Alignment via Hint-Guided Reflection
by: Zhao, Zihui, et al.
Published: (2025)
by: Zhao, Zihui, et al.
Published: (2025)
Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization
by: Zhou, Huilin, et al.
Published: (2026)
by: Zhou, Huilin, et al.
Published: (2026)
Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning
by: Ji, Zhengran, et al.
Published: (2025)
by: Ji, Zhengran, et al.
Published: (2025)
h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning
by: Motwani, Sumeet Ramesh, et al.
Published: (2025)
by: Motwani, Sumeet Ramesh, et al.
Published: (2025)
Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning
by: Kang, Hyungkyu, et al.
Published: (2025)
by: Kang, Hyungkyu, et al.
Published: (2025)
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
by: Kachroo, Darsh, et al.
Published: (2026)
by: Kachroo, Darsh, et al.
Published: (2026)
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
by: Zhang, Yuheng, et al.
Published: (2024)
by: Zhang, Yuheng, et al.
Published: (2024)
LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs
by: Li, Ang, et al.
Published: (2025)
by: Li, Ang, et al.
Published: (2025)
Calibration-Aware Policy Optimization for Reasoning LLMs
by: Wang, Ziqi, et al.
Published: (2026)
by: Wang, Ziqi, et al.
Published: (2026)
CuDIP: Enhancing Theorem Proving in LLMs via Curriculum Learning-based Direct Preference Optimization
by: Shi, Shuming, et al.
Published: (2025)
by: Shi, Shuming, et al.
Published: (2025)
Intelligently Weighting Multiple Reference Models for Direct Preference Optimization of LLMs
by: Wu, Skyler, et al.
Published: (2025)
by: Wu, Skyler, et al.
Published: (2025)
Policy-labeled Preference Learning: Is Preference Enough for RLHF?
by: Cho, Taehyun, et al.
Published: (2025)
by: Cho, Taehyun, et al.
Published: (2025)
Aligning Diffusion Language Models via Unpaired Preference Optimization
by: Jindal, Vaibhav, et al.
Published: (2025)
by: Jindal, Vaibhav, et al.
Published: (2025)
PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training
by: Lv, Mingrui, et al.
Published: (2025)
by: Lv, Mingrui, et al.
Published: (2025)
One-Way Policy Optimization for Self-Evolving LLMs
by: Yang, Shuo, et al.
Published: (2026)
by: Yang, Shuo, et al.
Published: (2026)
ICPL: Few-shot In-context Preference Learning via LLMs
by: Yu, Chao, et al.
Published: (2024)
by: Yu, Chao, et al.
Published: (2024)
Sample Efficient Preference Alignment in LLMs via Active Exploration
by: Mehta, Viraj, et al.
Published: (2023)
by: Mehta, Viraj, et al.
Published: (2023)
$ξ$-DPO: Direct Preference Optimization via Ratio Reward Margin
by: Fan, Zhengyuan, et al.
Published: (2026)
by: Fan, Zhengyuan, et al.
Published: (2026)
Twin-Boot: Uncertainty-Aware Optimization via Online Two-Sample Bootstrapping
by: Brito, Carlos Stein
Published: (2025)
by: Brito, Carlos Stein
Published: (2025)
Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning
by: Macuglia, Maël, et al.
Published: (2025)
by: Macuglia, Maël, et al.
Published: (2025)
COPR: Continual Human Preference Learning via Optimal Policy Regularization
by: Zhang, Han, et al.
Published: (2024)
by: Zhang, Han, et al.
Published: (2024)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
by: Xi, Zhiheng, et al.
Published: (2025)
by: Xi, Zhiheng, et al.
Published: (2025)
Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization
by: Ambadkar, Tanmay, et al.
Published: (2026)
by: Ambadkar, Tanmay, et al.
Published: (2026)
When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?
by: Hatgis-Kessell, Stephane, et al.
Published: (2026)
by: Hatgis-Kessell, Stephane, et al.
Published: (2026)
Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment
by: Wang, Jialu, et al.
Published: (2026)
by: Wang, Jialu, et al.
Published: (2026)
Offline Model-Based Optimization via Policy-Guided Gradient Search
by: Chemingui, Yassine, et al.
Published: (2024)
by: Chemingui, Yassine, et al.
Published: (2024)
MPPO: Multi Pair-wise Preference Optimization for LLMs with Arbitrary Negative Samples
by: Xie, Shuo, et al.
Published: (2024)
by: Xie, Shuo, et al.
Published: (2024)
Orthogonal Finetuning for Direct Preference Optimization
by: Yang, Chenxu, et al.
Published: (2024)
by: Yang, Chenxu, et al.
Published: (2024)
LLM-Based Scientific Equation Discovery via Physics-Informed Token-Regularized Policy Optimization
by: Wang, Boxiao, et al.
Published: (2026)
by: Wang, Boxiao, et al.
Published: (2026)
Thinking Preference Optimization
by: Yang, Wang, et al.
Published: (2025)
by: Yang, Wang, et al.
Published: (2025)
LOGICPO: Efficient Translation of NL-based Logical Problems to FOL using LLMs and Preference Optimization
by: Viswanadha, Koushik, et al.
Published: (2025)
by: Viswanadha, Koushik, et al.
Published: (2025)
PB$^2$: Preference Space Exploration via Population-Based Methods in Preference-Based Reinforcement Learning
by: Driss, Brahim, et al.
Published: (2025)
by: Driss, Brahim, et al.
Published: (2025)
Neural Dueling Bandits: Preference-Based Optimization with Human Feedback
by: Verma, Arun, et al.
Published: (2024)
by: Verma, Arun, et al.
Published: (2024)
Similar Items
-
Meta-Aligner: Bidirectional Preference-Policy Optimization for Multi-Objective LLMs Alignment
by: Xu, Wenzhe, et al.
Published: (2026) -
Flattening Hierarchies with Policy Bootstrapping
by: Zhou, John L., et al.
Published: (2025) -
Mitigating Preference Hacking in Policy Optimization with Pessimism
by: Gupta, Dhawal, et al.
Published: (2025) -
Aligning CodeLLMs with Direct Preference Optimization
by: Miao, Yibo, et al.
Published: (2024) -
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
by: Lai, Xin, et al.
Published: (2024)