:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Zhichao, Bi, Bin, Zhu, Zixu, Mao, Xiangbo, Wang, Jun, Wang, Shiyu, Wang, Cheng, Nie, Dong, Hong, Lingzi
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2410.21438
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
by: Wang, Zhichao
Published: (2025)

Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
by: Wang, Bo, et al.
Published: (2025)

UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types
by: Wang, Zhichao, et al.
Published: (2024)

UFT: Unifying Supervised and Reinforcement Fine-Tuning
by: Liu, Mingyang, et al.
Published: (2025)

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
by: Liu, Zhihan, et al.
Published: (2024)

RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following
by: Wang, Zhichao, et al.
Published: (2025)

Continual SFT Matches Multimodal RLHF with Negative Supervision
by: Zhu, Ke, et al.
Published: (2024)

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
by: Yang, Zhiqin, et al.
Published: (2026)

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
by: Wang, Yibin, et al.
Published: (2025)

RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment
by: Du, Yuhao, et al.
Published: (2025)

Bootstrapping Language Models with DPO Implicit Rewards
by: Chen, Changyu, et al.
Published: (2024)

DPO Meets PPO: Reinforced Token Optimization for RLHF
by: Zhong, Han, et al.
Published: (2024)

RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)

PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
by: Pentyala, Shiva Kumar, et al.
Published: (2024)

Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization
by: Zhang, Daoan, et al.
Published: (2024)

Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)

RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders
by: Yang, Zhongheng, et al.
Published: (2025)

Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF
by: Chidambaram, Keertana, et al.
Published: (2026)

Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation
by: Du, Jie, et al.
Published: (2025)

Reward Shaping to Mitigate Reward Hacking in RLHF
by: Fu, Jiayi, et al.
Published: (2025)

A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO
by: Zhou, Xingyu, et al.
Published: (2025)

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
by: Hong, Joey, et al.
Published: (2024)

Reward Generalization in RLHF: A Topological Perspective
by: Qiu, Tianyi, et al.
Published: (2024)

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs
by: Lin, Jiacheng, et al.
Published: (2025)

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
by: Qi, Xuan, et al.
Published: (2025)

Preserving Domain Generalization in Fine-Tuning via Joint Parameter Selection
by: Pan, Bin, et al.
Published: (2025)

RL Fine-Tuning Heals OOD Forgetting in SFT
by: Jin, Hangzhan, et al.
Published: (2025)

An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models
by: Feng, Yuming, et al.
Published: (2026)

Prototypical Reward Network for Data-Efficient RLHF
by: Zhang, Jinghan, et al.
Published: (2024)

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
by: Wang, Hao, et al.
Published: (2026)

Information-Theoretic Reward Decomposition for Generalizable RLHF
by: Mao, Liyuan, et al.
Published: (2025)

ImplicitRM: Unbiased Reward Modeling from Implicit Preference Data for LLM alignment
by: Wang, Hao, et al.
Published: (2026)

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization
by: Wu, Junkang, et al.
Published: (2024)

Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity
by: Wang, Tuowei, et al.
Published: (2025)

SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning
by: Chen, Yijie, et al.
Published: (2026)

Process Reinforcement through Implicit Rewards
by: Cui, Ganqu, et al.
Published: (2025)

Reward Difference Optimization For Sample Reweighting In Offline RLHF
by: Wang, Shiqi, et al.
Published: (2024)

Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG
by: Wang, Zhichao, et al.
Published: (2025)

Iterative Foundation Model Fine-Tuning on Multiple Rewards
by: Ghari, Pouya M., et al.
Published: (2025)

BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF
by: Duan, Kaiwen, et al.
Published: (2025)