Saved in:
| Main Authors: | Zhang, Xiaoying, Ton, Jean-Francois, Shen, Wei, Wang, Hongning, Liu, Yang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.05171 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Reviving The Classics: Active Reward Modeling in Large Language Model Alignment
by: Shen, Yunyi, et al.
Published: (2025)
by: Shen, Yunyi, et al.
Published: (2025)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
by: Liu, Zhihan, et al.
Published: (2024)
by: Liu, Zhihan, et al.
Published: (2024)
Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization
by: Kim, Sunghwan, et al.
Published: (2025)
by: Kim, Sunghwan, et al.
Published: (2025)
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF
by: Zhu, Banghua, et al.
Published: (2024)
by: Zhu, Banghua, et al.
Published: (2024)
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
by: Rafailov, Rafael, et al.
Published: (2024)
by: Rafailov, Rafael, et al.
Published: (2024)
Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs
by: Sun, Hao, et al.
Published: (2025)
by: Sun, Hao, et al.
Published: (2025)
Trust-Region Adaptive Policy Optimization
by: Su, Mingyu, et al.
Published: (2025)
by: Su, Mingyu, et al.
Published: (2025)
ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization
by: Zhang, Chen Bo Calvin, et al.
Published: (2024)
by: Zhang, Chen Bo Calvin, et al.
Published: (2024)
Understanding Chain-of-Thought in LLMs through Information Theory
by: Ton, Jean-Francois, et al.
Published: (2024)
by: Ton, Jean-Francois, et al.
Published: (2024)
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
by: Liu, Yang, et al.
Published: (2023)
by: Liu, Yang, et al.
Published: (2023)
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
by: Huang, Audrey, et al.
Published: (2024)
by: Huang, Audrey, et al.
Published: (2024)
DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization
by: Li, Gang, et al.
Published: (2025)
by: Li, Gang, et al.
Published: (2025)
Pretrain Value, Not Reward: Decoupled Value Policy Optimization
by: Huang, Chenghua, et al.
Published: (2025)
by: Huang, Chenghua, et al.
Published: (2025)
Towards Robust Policy: Enhancing Offline Reinforcement Learning with Adversarial Attacks and Defenses
by: Nguyen, Thanh, et al.
Published: (2024)
by: Nguyen, Thanh, et al.
Published: (2024)
Value-Free Policy Optimization via Reward Partitioning
by: Faye, Bilal, et al.
Published: (2025)
by: Faye, Bilal, et al.
Published: (2025)
Measuring and Reducing LLM Hallucination without Gold-Standard Answers
by: Wei, Jiaheng, et al.
Published: (2024)
by: Wei, Jiaheng, et al.
Published: (2024)
Policy Filtration for RLHF to Mitigate Noise in Reward Models
by: Zhang, Chuheng, et al.
Published: (2024)
by: Zhang, Chuheng, et al.
Published: (2024)
Intrinsic Reward Policy Optimization for Sparse-Reward Environments
by: Cho, Minjae, et al.
Published: (2026)
by: Cho, Minjae, et al.
Published: (2026)
Adversarial Training for Process Reward Models
by: Juneja, Gurusha, et al.
Published: (2025)
by: Juneja, Gurusha, et al.
Published: (2025)
Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
by: Chen, Ziyi, et al.
Published: (2025)
by: Chen, Ziyi, et al.
Published: (2025)
Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm
by: Chen, Yang, et al.
Published: (2025)
by: Chen, Yang, et al.
Published: (2025)
RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training
by: Ren, Tao, et al.
Published: (2025)
by: Ren, Tao, et al.
Published: (2025)
Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking
by: Beigi, Mohammad, et al.
Published: (2026)
by: Beigi, Mohammad, et al.
Published: (2026)
Selective Preference Optimization via Token-Level Reward Function Estimation
by: Yang, Kailai, et al.
Published: (2024)
by: Yang, Kailai, et al.
Published: (2024)
Invariant Learning via Probability of Sufficient and Necessary Causes
by: Yang, Mengyue, et al.
Published: (2023)
by: Yang, Mengyue, et al.
Published: (2023)
Overcoming Overfitting in Reinforcement Learning via Gaussian Process Diffusion Policy
by: Horprasert, Amornyos, et al.
Published: (2025)
by: Horprasert, Amornyos, et al.
Published: (2025)
UCPO: Uncertainty-Aware Policy Optimization
by: Zeng, Xianzhou, et al.
Published: (2026)
by: Zeng, Xianzhou, et al.
Published: (2026)
CROP: Conservative Reward for Model-based Offline Policy Optimization
by: Li, Hao, et al.
Published: (2023)
by: Li, Hao, et al.
Published: (2023)
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
by: Patel, Nirmal, et al.
Published: (2026)
by: Patel, Nirmal, et al.
Published: (2026)
GOPO: Policy Optimization using Ranked Rewards
by: Choi, Kyuseong, et al.
Published: (2026)
by: Choi, Kyuseong, et al.
Published: (2026)
Mutual-Taught for Co-adapting Policy and Reward Models
by: Shi, Tianyuan, et al.
Published: (2025)
by: Shi, Tianyuan, et al.
Published: (2025)
How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs
by: Estornell, Andrew, et al.
Published: (2025)
by: Estornell, Andrew, et al.
Published: (2025)
Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
by: Yang, Junxiao, et al.
Published: (2025)
by: Yang, Junxiao, et al.
Published: (2025)
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings
by: Wu, Yuning, et al.
Published: (2026)
by: Wu, Yuning, et al.
Published: (2026)
Enhancing Adversarial Training via Reweighting Optimization Trajectory
by: Huang, Tianjin, et al.
Published: (2023)
by: Huang, Tianjin, et al.
Published: (2023)
Federated Linear Contextual Bandits with Heterogeneous Clients
by: Blaser, Ethan, et al.
Published: (2024)
by: Blaser, Ethan, et al.
Published: (2024)
Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization
by: Wang, Mingyi, et al.
Published: (2026)
by: Wang, Mingyi, et al.
Published: (2026)
PerPO: Perceptual Preference Optimization via Discriminative Rewarding
by: Zhu, Zining, et al.
Published: (2025)
by: Zhu, Zining, et al.
Published: (2025)
Disentangling Policy from Offline Task Representation Learning via Adversarial Data Augmentation
by: Jia, Chengxing, et al.
Published: (2024)
by: Jia, Chengxing, et al.
Published: (2024)
ReDit: Reward Dithering for Improved LLM Policy Optimization
by: Wei, Chenxing, et al.
Published: (2025)
by: Wei, Chenxing, et al.
Published: (2025)
Similar Items
-
Reviving The Classics: Active Reward Modeling in Large Language Model Alignment
by: Shen, Yunyi, et al.
Published: (2025) -
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
by: Liu, Zhihan, et al.
Published: (2024) -
Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization
by: Kim, Sunghwan, et al.
Published: (2025) -
Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF
by: Zhu, Banghua, et al.
Published: (2024) -
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms
by: Rafailov, Rafael, et al.
Published: (2024)