Saved in:
| Main Authors: | Ye, Chenlu, Xiong, Wei, Zhang, Yuheng, Dong, Hanze, Jiang, Nan, Zhang, Tong |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.07314 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
by: Xiong, Wei, et al.
Published: (2023)
by: Xiong, Wei, et al.
Published: (2023)
Logarithmic Regret for Online KL-Regularized Reinforcement Learning
by: Zhao, Heyang, et al.
Published: (2025)
by: Zhao, Heyang, et al.
Published: (2025)
Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives
by: Xiong, Wei, et al.
Published: (2025)
by: Xiong, Wei, et al.
Published: (2025)
Corruption-Robust Offline Reinforcement Learning with General Function Approximation
by: Ye, Chenlu, et al.
Published: (2023)
by: Ye, Chenlu, et al.
Published: (2023)
Self-rewarding correction for mathematical reasoning
by: Xiong, Wei, et al.
Published: (2025)
by: Xiong, Wei, et al.
Published: (2025)
Towards Robust Model-Based Reinforcement Learning Against Adversarial Corruption
by: Ye, Chenlu, et al.
Published: (2024)
by: Ye, Chenlu, et al.
Published: (2024)
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
by: Zhang, Yuheng, et al.
Published: (2026)
by: Zhang, Yuheng, et al.
Published: (2026)
RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)
by: Dong, Hanze, et al.
Published: (2024)
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
by: Zhang, Yuheng, et al.
Published: (2024)
by: Zhang, Yuheng, et al.
Published: (2024)
Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes
by: Ye, Chenlu, et al.
Published: (2022)
by: Ye, Chenlu, et al.
Published: (2022)
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
by: Yao, Jiarui, et al.
Published: (2025)
by: Yao, Jiarui, et al.
Published: (2025)
Contextual Online Uncertainty-Aware Preference Learning for Human Feedback
by: Lu, Nan, et al.
Published: (2025)
by: Lu, Nan, et al.
Published: (2025)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
by: Xiong, Wei, et al.
Published: (2025)
by: Xiong, Wei, et al.
Published: (2025)
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
by: Zhang, Yuheng, et al.
Published: (2025)
by: Zhang, Yuheng, et al.
Published: (2025)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback
by: Hong, Ilgee, et al.
Published: (2024)
by: Hong, Ilgee, et al.
Published: (2024)
Data-dependent Exploration for Online Reinforcement Learning from Human Feedback
by: Zhang, Zhen-Yu, et al.
Published: (2026)
by: Zhang, Zhen-Yu, et al.
Published: (2026)
Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model
by: Tu, Songjun, et al.
Published: (2024)
by: Tu, Songjun, et al.
Published: (2024)
Self-Hinting Language Models Enhance Reinforcement Learning
by: Liao, Baohao, et al.
Published: (2026)
by: Liao, Baohao, et al.
Published: (2026)
DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
by: Xiong, Guojun, et al.
Published: (2024)
by: Xiong, Guojun, et al.
Published: (2024)
Automatic Curriculum Expert Iteration for Reliable LLM Reasoning
by: Zhao, Zirui, et al.
Published: (2024)
by: Zhao, Zirui, et al.
Published: (2024)
Multi-turn Reinforcement Learning from Preference Human Feedback
by: Shani, Lior, et al.
Published: (2024)
by: Shani, Lior, et al.
Published: (2024)
Robust Reinforcement Learning from Corrupted Human Feedback
by: Bukharin, Alexander, et al.
Published: (2024)
by: Bukharin, Alexander, et al.
Published: (2024)
Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models
by: Hao, Yifan, et al.
Published: (2025)
by: Hao, Yifan, et al.
Published: (2025)
Building Math Agents with Multi-Turn Iterative Preference Learning
by: Xiong, Wei, et al.
Published: (2024)
by: Xiong, Wei, et al.
Published: (2024)
Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
by: Zhao, Heyang, et al.
Published: (2024)
by: Zhao, Heyang, et al.
Published: (2024)
Catoni Contextual Bandits are Robust to Heavy-tailed Rewards
by: Ye, Chenlu, et al.
Published: (2025)
by: Ye, Chenlu, et al.
Published: (2025)
On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation
by: Zhang, Yuheng, et al.
Published: (2024)
by: Zhang, Yuheng, et al.
Published: (2024)
Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs
by: Zhang, Yuheng, et al.
Published: (2025)
by: Zhang, Yuheng, et al.
Published: (2025)
Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback
by: Zhou, Runlong, et al.
Published: (2025)
by: Zhou, Runlong, et al.
Published: (2025)
Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
by: Kim, Gihoon, et al.
Published: (2026)
by: Kim, Gihoon, et al.
Published: (2026)
Combinatorial Reinforcement Learning with Preference Feedback
by: Lee, Joongkyu, et al.
Published: (2025)
by: Lee, Joongkyu, et al.
Published: (2025)
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
by: Poddar, Sriyash, et al.
Published: (2024)
by: Poddar, Sriyash, et al.
Published: (2024)
An Improved Analysis of Langevin Algorithms with Prior Diffusion for Non-Log-Concave Sampling
by: Huang, Xunpeng, et al.
Published: (2024)
by: Huang, Xunpeng, et al.
Published: (2024)
Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference
by: Cercola, Matteo, et al.
Published: (2025)
by: Cercola, Matteo, et al.
Published: (2025)
Beyond Pessimism: Offline Learning in KL-regularized Games
by: Zhang, Yuheng, et al.
Published: (2026)
by: Zhang, Yuheng, et al.
Published: (2026)
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis
by: Zhang, Qining, et al.
Published: (2024)
by: Zhang, Qining, et al.
Published: (2024)
Faster Sampling via Stochastic Gradient Proximal Sampler
by: Huang, Xunpeng, et al.
Published: (2024)
by: Huang, Xunpeng, et al.
Published: (2024)
Dual Active Learning for Reinforcement Learning from Human Feedback
by: Liu, Pangpang, et al.
Published: (2024)
by: Liu, Pangpang, et al.
Published: (2024)
Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning
by: Sun, Mingyang, et al.
Published: (2025)
by: Sun, Mingyang, et al.
Published: (2025)
LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning
by: Jian, Pingcheng, et al.
Published: (2025)
by: Jian, Pingcheng, et al.
Published: (2025)
Similar Items
-
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
by: Xiong, Wei, et al.
Published: (2023) -
Logarithmic Regret for Online KL-Regularized Reinforcement Learning
by: Zhao, Heyang, et al.
Published: (2025) -
Reinforce-Ada: An Adaptive Sampling Framework under Non-linear RL Objectives
by: Xiong, Wei, et al.
Published: (2025) -
Corruption-Robust Offline Reinforcement Learning with General Function Approximation
by: Ye, Chenlu, et al.
Published: (2023) -
Self-rewarding correction for mathematical reasoning
by: Xiong, Wei, et al.
Published: (2025)