Saved in:
| Main Authors: | Wang, Zhichao, Bi, Bin, Zhu, Zixu, Mao, Xiangbo, Wang, Jun, Wang, Shiyu, Wang, Cheng, Nie, Dong, Hong, Lingzi |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.21438 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
by: Wang, Zhichao
Published: (2025)
by: Wang, Zhichao
Published: (2025)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
by: Wang, Bo, et al.
Published: (2025)
by: Wang, Bo, et al.
Published: (2025)
UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types
by: Wang, Zhichao, et al.
Published: (2024)
by: Wang, Zhichao, et al.
Published: (2024)
UFT: Unifying Supervised and Reinforcement Fine-Tuning
by: Liu, Mingyang, et al.
Published: (2025)
by: Liu, Mingyang, et al.
Published: (2025)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
by: Liu, Zhihan, et al.
Published: (2024)
by: Liu, Zhihan, et al.
Published: (2024)
RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
by: Wang, Zhichao, et al.
Published: (2025)
by: Wang, Zhichao, et al.
Published: (2025)
Continual SFT Matches Multimodal RLHF with Negative Supervision
by: Zhu, Ke, et al.
Published: (2024)
by: Zhu, Ke, et al.
Published: (2024)
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
by: Yang, Zhiqin, et al.
Published: (2026)
by: Yang, Zhiqin, et al.
Published: (2026)
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
by: Wang, Yibin, et al.
Published: (2025)
by: Wang, Yibin, et al.
Published: (2025)
RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment
by: Du, Yuhao, et al.
Published: (2025)
by: Du, Yuhao, et al.
Published: (2025)
Bootstrapping Language Models with DPO Implicit Rewards
by: Chen, Changyu, et al.
Published: (2024)
by: Chen, Changyu, et al.
Published: (2024)
DPO Meets PPO: Reinforced Token Optimization for RLHF
by: Zhong, Han, et al.
Published: (2024)
by: Zhong, Han, et al.
Published: (2024)
RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)
by: Dong, Hanze, et al.
Published: (2024)
PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
by: Pentyala, Shiva Kumar, et al.
Published: (2024)
by: Pentyala, Shiva Kumar, et al.
Published: (2024)
Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization
by: Zhang, Daoan, et al.
Published: (2024)
by: Zhang, Daoan, et al.
Published: (2024)
Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)
by: Yan, Yuzi, et al.
Published: (2024)
RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders
by: Yang, Zhongheng, et al.
Published: (2025)
by: Yang, Zhongheng, et al.
Published: (2025)
Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
by: Chidambaram, Keertana, et al.
Published: (2026)
by: Chidambaram, Keertana, et al.
Published: (2026)
Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation
by: Du, Jie, et al.
Published: (2025)
by: Du, Jie, et al.
Published: (2025)
Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)
by: Fu, Jiayi, et al.
Published: (2025)
A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO
by: Zhou, Xingyu, et al.
Published: (2025)
by: Zhou, Xingyu, et al.
Published: (2025)
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
by: Hong, Joey, et al.
Published: (2024)
by: Hong, Joey, et al.
Published: (2024)
Reward Generalization in RLHF: A Topological Perspective
by: Qiu, Tianyi, et al.
Published: (2024)
by: Qiu, Tianyi, et al.
Published: (2024)
SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs
by: Lin, Jiacheng, et al.
Published: (2025)
by: Lin, Jiacheng, et al.
Published: (2025)
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
by: Qi, Xuan, et al.
Published: (2025)
by: Qi, Xuan, et al.
Published: (2025)
Preserving Domain Generalization in Fine-Tuning via Joint Parameter Selection
by: Pan, Bin, et al.
Published: (2025)
by: Pan, Bin, et al.
Published: (2025)
RL Fine-Tuning Heals OOD Forgetting in SFT
by: Jin, Hangzhan, et al.
Published: (2025)
by: Jin, Hangzhan, et al.
Published: (2025)
An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
by: Feng, Yuming, et al.
Published: (2026)
by: Feng, Yuming, et al.
Published: (2026)
Prototypical Reward Network for Data-Efficient RLHF
by: Zhang, Jinghan, et al.
Published: (2024)
by: Zhang, Jinghan, et al.
Published: (2024)
CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
by: Wang, Hao, et al.
Published: (2026)
by: Wang, Hao, et al.
Published: (2026)
Information-Theoretic Reward Decomposition for Generalizable RLHF
by: Mao, Liyuan, et al.
Published: (2025)
by: Mao, Liyuan, et al.
Published: (2025)
ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
by: Wang, Hao, et al.
Published: (2026)
by: Wang, Hao, et al.
Published: (2026)
AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization
by: Wu, Junkang, et al.
Published: (2024)
by: Wu, Junkang, et al.
Published: (2024)
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity
by: Wang, Tuowei, et al.
Published: (2025)
by: Wang, Tuowei, et al.
Published: (2025)
SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning
by: Chen, Yijie, et al.
Published: (2026)
by: Chen, Yijie, et al.
Published: (2026)
Process Reinforcement through Implicit Rewards
by: Cui, Ganqu, et al.
Published: (2025)
by: Cui, Ganqu, et al.
Published: (2025)
Reward Difference Optimization For Sample Reweighting In Offline RLHF
by: Wang, Shiqi, et al.
Published: (2024)
by: Wang, Shiqi, et al.
Published: (2024)
Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG
by: Wang, Zhichao, et al.
Published: (2025)
by: Wang, Zhichao, et al.
Published: (2025)
Iterative Foundation Model Fine-Tuning on Multiple Rewards
by: Ghari, Pouya M., et al.
Published: (2025)
by: Ghari, Pouya M., et al.
Published: (2025)
BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF
by: Duan, Kaiwen, et al.
Published: (2025)
by: Duan, Kaiwen, et al.
Published: (2025)
Similar Items
-
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
by: Wang, Zhichao
Published: (2025) -
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
by: Wang, Bo, et al.
Published: (2025) -
UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types
by: Wang, Zhichao, et al.
Published: (2024) -
UFT: Unifying Supervised and Reinforcement Fine-Tuning
by: Liu, Mingyang, et al.
Published: (2025) -
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
by: Liu, Zhihan, et al.
Published: (2024)