Saved in:
| Main Authors: | Li, Long-Fei, Qian, Yu-Yang, Zhao, Peng, Zhou, Zhi-Hua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.07193 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation
by: Li, Long-Fei, et al.
Published: (2024)
by: Li, Long-Fei, et al.
Published: (2024)
RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)
by: Dong, Hanze, et al.
Published: (2024)
Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update
by: Wang, Jing, et al.
Published: (2025)
by: Wang, Jing, et al.
Published: (2025)
Greedy Sampling Is Provably Efficient for RLHF
by: Wu, Di, et al.
Published: (2025)
by: Wu, Di, et al.
Published: (2025)
Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
by: Chen, Ziyi, et al.
Published: (2025)
by: Chen, Ziyi, et al.
Published: (2025)
Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs
by: Li, Long-Fei, et al.
Published: (2024)
by: Li, Long-Fei, et al.
Published: (2024)
Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition
by: Li, Long-Fei, et al.
Published: (2024)
by: Li, Long-Fei, et al.
Published: (2024)
Bias Fitting to Mitigate Length Bias of Reward Model in RLHF
by: Zhao, Kangwen, et al.
Published: (2025)
by: Zhao, Kangwen, et al.
Published: (2025)
Efficient Methods for Non-stationary Online Learning
by: Zhao, Peng, et al.
Published: (2023)
by: Zhao, Peng, et al.
Published: (2023)
Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)
by: Yan, Yuzi, et al.
Published: (2024)
Optimal Design for Reward Modeling in RLHF
by: Scheid, Antoine, et al.
Published: (2024)
by: Scheid, Antoine, et al.
Published: (2024)
Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach
by: Yan, Yu-Hu, et al.
Published: (2023)
by: Yan, Yu-Hu, et al.
Published: (2023)
A Simple, Optimal and Efficient Algorithm for Online Exp-Concave Optimization
by: Wang, Yi-Han, et al.
Published: (2025)
by: Wang, Yi-Han, et al.
Published: (2025)
Policy Filtration for RLHF to Mitigate Noise in Reward Models
by: Zhang, Chuheng, et al.
Published: (2024)
by: Zhang, Chuheng, et al.
Published: (2024)
Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)
by: Fu, Jiayi, et al.
Published: (2025)
How to Evaluate Reward Models for RLHF
by: Frick, Evan, et al.
Published: (2024)
by: Frick, Evan, et al.
Published: (2024)
Factored Causal Representation Learning for Robust Reward Modeling in RLHF
by: Yang, Yupei, et al.
Published: (2026)
by: Yang, Yupei, et al.
Published: (2026)
Optimistic Online-to-Batch Conversions for Accelerated Convergence and Universality
by: Yan, Yu-Hu, et al.
Published: (2025)
by: Yan, Yu-Hu, et al.
Published: (2025)
Learning a Pessimistic Reward Model in RLHF
by: Xu, Yinglun, et al.
Published: (2025)
by: Xu, Yinglun, et al.
Published: (2025)
Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization
by: Dai, Juntao, et al.
Published: (2025)
by: Dai, Juntao, et al.
Published: (2025)
Reward Model Overoptimisation in Iterated RLHF
by: Wolf, Lorenz, et al.
Published: (2025)
by: Wolf, Lorenz, et al.
Published: (2025)
Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective
by: Huang, Jiawei, et al.
Published: (2025)
by: Huang, Jiawei, et al.
Published: (2025)
TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree
by: Qian, Yu-Yang, et al.
Published: (2025)
by: Qian, Yu-Yang, et al.
Published: (2025)
Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking
by: Miao, Yuchun, et al.
Published: (2025)
by: Miao, Yuchun, et al.
Published: (2025)
Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization
by: Zhao, Peng, et al.
Published: (2021)
by: Zhao, Peng, et al.
Published: (2021)
Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization
by: Zhao, Peng, et al.
Published: (2025)
by: Zhao, Peng, et al.
Published: (2025)
One-Step Bellman Alignment Enables Provably Efficient Transfer in Online RL
by: Chen, Elynn, et al.
Published: (2026)
by: Chen, Elynn, et al.
Published: (2026)
Reward Generalization in RLHF: A Topological Perspective
by: Qiu, Tianyi, et al.
Published: (2024)
by: Qiu, Tianyi, et al.
Published: (2024)
Quantile Regression for Distributional Reward Models in RLHF
by: Dorka, Nicolai
Published: (2024)
by: Dorka, Nicolai
Published: (2024)
Gradient-Variation Online Learning under Generalized Smoothness
by: Xie, Yan-Feng, et al.
Published: (2024)
by: Xie, Yan-Feng, et al.
Published: (2024)
BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF
by: Duan, Kaiwen, et al.
Published: (2025)
by: Duan, Kaiwen, et al.
Published: (2025)
Accelerating RLHF Training with Reward Variance Increase
by: Yang, Zonglin, et al.
Published: (2025)
by: Yang, Zonglin, et al.
Published: (2025)
It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
by: Lu, Taiming, et al.
Published: (2024)
by: Lu, Taiming, et al.
Published: (2024)
Provably Efficient Interactive-Grounded Learning with Personalized Reward
by: Zhang, Mengxiao, et al.
Published: (2024)
by: Zhang, Mengxiao, et al.
Published: (2024)
ODIN: Disentangled Reward Mitigates Hacking in RLHF
by: Chen, Lichang, et al.
Published: (2024)
by: Chen, Lichang, et al.
Published: (2024)
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
by: Yang, Zhiqin, et al.
Published: (2026)
by: Yang, Zhiqin, et al.
Published: (2026)
A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization
by: Xu, Wenyuan, et al.
Published: (2025)
by: Xu, Wenyuan, et al.
Published: (2025)
Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update
by: Zhang, Yu-Jie, et al.
Published: (2025)
by: Zhang, Yu-Jie, et al.
Published: (2025)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
by: Liu, Zhihan, et al.
Published: (2024)
by: Liu, Zhihan, et al.
Published: (2024)
Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions
by: Wang, Jing, et al.
Published: (2026)
by: Wang, Jing, et al.
Published: (2026)
Similar Items
-
Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation
by: Li, Long-Fei, et al.
Published: (2024) -
RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024) -
Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update
by: Wang, Jing, et al.
Published: (2025) -
Greedy Sampling Is Provably Efficient for RLHF
by: Wu, Di, et al.
Published: (2025) -
Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
by: Chen, Ziyi, et al.
Published: (2025)