Saved in:
| Main Authors: | Nan, Tianlong, Li, Xiaopeng, Kroer, Christian, Lin, Tianyi |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2606.01382 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
by: Zhang, Yuheng, et al.
Published: (2024)
by: Zhang, Yuheng, et al.
Published: (2024)
Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
by: Chen, Peter, et al.
Published: (2025)
by: Chen, Peter, et al.
Published: (2025)
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
by: Chen, Peter, et al.
Published: (2025)
by: Chen, Peter, et al.
Published: (2025)
Reward-free Alignment for Conflicting Objectives
by: Chen, Peter, et al.
Published: (2026)
by: Chen, Peter, et al.
Published: (2026)
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
by: Rosset, Corby, et al.
Published: (2024)
by: Rosset, Corby, et al.
Published: (2024)
Sample Efficient Preference Alignment in LLMs via Active Exploration
by: Mehta, Viraj, et al.
Published: (2023)
by: Mehta, Viraj, et al.
Published: (2023)
No-Regret Learning Under Adversarial Resource Constraints: A Spending Plan Is All You Need!
by: Stradi, Francesco Emanuele, et al.
Published: (2025)
by: Stradi, Francesco Emanuele, et al.
Published: (2025)
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning
by: Lin, Xiaotian, et al.
Published: (2025)
by: Lin, Xiaotian, et al.
Published: (2025)
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
by: Lin, Xiaoqiang, et al.
Published: (2025)
by: Lin, Xiaoqiang, et al.
Published: (2025)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
by: Xiong, Wei, et al.
Published: (2023)
by: Xiong, Wei, et al.
Published: (2023)
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
by: Moya, Christian, et al.
Published: (2026)
by: Moya, Christian, et al.
Published: (2026)
ComPO: Preference Alignment via Comparison Oracles
by: Chen, Peter, et al.
Published: (2025)
by: Chen, Peter, et al.
Published: (2025)
Efficient Preference-Based Reinforcement Learning: Randomized Exploration Meets Experimental Design
by: Schlaginhaufen, Andreas, et al.
Published: (2025)
by: Schlaginhaufen, Andreas, et al.
Published: (2025)
MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning
by: Lin, Yunze
Published: (2025)
by: Lin, Yunze
Published: (2025)
Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
by: Ren, Jie, et al.
Published: (2025)
by: Ren, Jie, et al.
Published: (2025)
Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision
by: Ye, Yaowen, et al.
Published: (2025)
by: Ye, Yaowen, et al.
Published: (2025)
Preferred-Action-Optimized Diffusion Policies for Offline Reinforcement Learning
by: Zhang, Tianle, et al.
Published: (2024)
by: Zhang, Tianle, et al.
Published: (2024)
Preference Guided Iterated Pareto Referent Optimisation for Accessible Route Planning
by: Speziali, Paolo, et al.
Published: (2026)
by: Speziali, Paolo, et al.
Published: (2026)
Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation
by: Liu, Xiaotian, et al.
Published: (2026)
by: Liu, Xiaotian, et al.
Published: (2026)
Automated Skill Discovery for Language Agents through Exploration and Iterative Feedback
by: Yang, Yongjin, et al.
Published: (2025)
by: Yang, Yongjin, et al.
Published: (2025)
Active Preference Optimization for Sample Efficient RLHF
by: Das, Nirjhar, et al.
Published: (2024)
by: Das, Nirjhar, et al.
Published: (2024)
PB$^2$: Preference Space Exploration via Population-Based Methods in Preference-Based Reinforcement Learning
by: Driss, Brahim, et al.
Published: (2025)
by: Driss, Brahim, et al.
Published: (2025)
FraPPE: Fast and Efficient Preference-based Pure Exploration
by: Das, Udvas, et al.
Published: (2025)
by: Das, Udvas, et al.
Published: (2025)
On the Role of Preference Variance in Preference Optimization
by: Guo, Jiacheng, et al.
Published: (2025)
by: Guo, Jiacheng, et al.
Published: (2025)
Thinking Preference Optimization
by: Yang, Wang, et al.
Published: (2025)
by: Yang, Wang, et al.
Published: (2025)
Aligning CodeLLMs with Direct Preference Optimization
by: Miao, Yibo, et al.
Published: (2024)
by: Miao, Yibo, et al.
Published: (2024)
Risk-aware Direct Preference Optimization under Nested Risk Measure
by: Zhang, Lijun, et al.
Published: (2025)
by: Zhang, Lijun, et al.
Published: (2025)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
by: Xie, Yuxi, et al.
Published: (2024)
by: Xie, Yuxi, et al.
Published: (2024)
Preference as Reward, Maximum Preference Optimization with Importance Sampling
by: Jiang, Zaifan, et al.
Published: (2023)
by: Jiang, Zaifan, et al.
Published: (2023)
Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
by: Luo, Haocheng, et al.
Published: (2026)
by: Luo, Haocheng, et al.
Published: (2026)
Efficient Exploration at Scale
by: Asghari, Seyed Mohammad, et al.
Published: (2026)
by: Asghari, Seyed Mohammad, et al.
Published: (2026)
Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation
by: Patel, Bhrij, et al.
Published: (2023)
by: Patel, Bhrij, et al.
Published: (2023)
Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game
by: Cheng, Pengyu, et al.
Published: (2023)
by: Cheng, Pengyu, et al.
Published: (2023)
OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration
by: Yang, Yiqin, et al.
Published: (2026)
by: Yang, Yiqin, et al.
Published: (2026)
Nash CoT: Multi-Path Inference with Preference Equilibrium
by: Zhang, Ziqi, et al.
Published: (2024)
by: Zhang, Ziqi, et al.
Published: (2024)
Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
by: Yan, Tianyi Lorena, et al.
Published: (2025)
by: Yan, Tianyi Lorena, et al.
Published: (2025)
Neural Dueling Bandits: Preference-Based Optimization with Human Feedback
by: Verma, Arun, et al.
Published: (2024)
by: Verma, Arun, et al.
Published: (2024)
The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation
by: Zhang, Ruichen, et al.
Published: (2025)
by: Zhang, Ruichen, et al.
Published: (2025)
Provably Efficient Exploration in Inverse Constrained Reinforcement Learning
by: Yue, Bo, et al.
Published: (2024)
by: Yue, Bo, et al.
Published: (2024)
Graph Unlearning Meets Influence-aware Negative Preference Optimization
by: Chen, Qiang, et al.
Published: (2025)
by: Chen, Qiang, et al.
Published: (2025)
Similar Items
-
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
by: Zhang, Yuheng, et al.
Published: (2024) -
Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
by: Chen, Peter, et al.
Published: (2025) -
Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO
by: Chen, Peter, et al.
Published: (2025) -
Reward-free Alignment for Conflicting Objectives
by: Chen, Peter, et al.
Published: (2026) -
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
by: Rosset, Corby, et al.
Published: (2024)