Saved in:
| Main Authors: | Hong, Jiwoo, Tang, Shao, Wang, Zhipeng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.08819 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
On the Robustness of Reward Models for Language Model Alignment
by: Hong, Jiwoo, et al.
Published: (2025)
by: Hong, Jiwoo, et al.
Published: (2025)
Preference Poisoning Attacks on Reward Model Learning
by: Wu, Junlin, et al.
Published: (2024)
by: Wu, Junlin, et al.
Published: (2024)
Generalizing Reward Modeling for Out-of-Distribution Preference Learning
by: Jia, Chen
Published: (2024)
by: Jia, Chen
Published: (2024)
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
by: Karlekar, Sweta, et al.
Published: (2026)
by: Karlekar, Sweta, et al.
Published: (2026)
Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input
by: Peng, Andi, et al.
Published: (2024)
by: Peng, Andi, et al.
Published: (2024)
Robust Preference Optimization through Reward Model Distillation
by: Fisch, Adam, et al.
Published: (2024)
by: Fisch, Adam, et al.
Published: (2024)
Deep Bayesian Active Learning for Preference Modeling in Large Language Models
by: Melo, Luckeciano C., et al.
Published: (2024)
by: Melo, Luckeciano C., et al.
Published: (2024)
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
by: Wang, Haoxiang, et al.
Published: (2024)
by: Wang, Haoxiang, et al.
Published: (2024)
Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search
by: Cui, Yingqian, et al.
Published: (2025)
by: Cui, Yingqian, et al.
Published: (2025)
On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
by: Lin, Yong, et al.
Published: (2024)
by: Lin, Yong, et al.
Published: (2024)
Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
by: Sun, Shengyang, et al.
Published: (2025)
by: Sun, Shengyang, et al.
Published: (2025)
RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance
by: Chen, Tianlang, et al.
Published: (2025)
by: Chen, Tianlang, et al.
Published: (2025)
Beyond Scalar Reward Model: Learning Generative Judge from Preference Data
by: Ye, Ziyi, et al.
Published: (2024)
by: Ye, Ziyi, et al.
Published: (2024)
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
by: Kawabata, Akira, et al.
Published: (2026)
by: Kawabata, Akira, et al.
Published: (2026)
Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment
by: Zhang, Yunfan, et al.
Published: (2025)
by: Zhang, Yunfan, et al.
Published: (2025)
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both
by: Nath, Abhijnan, et al.
Published: (2024)
by: Nath, Abhijnan, et al.
Published: (2024)
Group Robust Preference Optimization in Reward-free RLHF
by: Ramesh, Shyam Sundhar, et al.
Published: (2024)
by: Ramesh, Shyam Sundhar, et al.
Published: (2024)
Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
by: Zhao, Wenshuo, et al.
Published: (2026)
by: Zhao, Wenshuo, et al.
Published: (2026)
Tool Calling is Linearly Readable and Steerable in Language Models
by: Wu, Zekun, et al.
Published: (2026)
by: Wu, Zekun, et al.
Published: (2026)
Larger or Smaller Reward Margins to Select Preferences for Alignment?
by: Huang, Kexin, et al.
Published: (2025)
by: Huang, Kexin, et al.
Published: (2025)
West-of-N: Synthetic Preferences for Self-Improving Reward Models
by: Pace, Alizée, et al.
Published: (2024)
by: Pace, Alizée, et al.
Published: (2024)
EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preferences
by: Ghate, Kshitish, et al.
Published: (2025)
by: Ghate, Kshitish, et al.
Published: (2025)
AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning
by: Wang, Tevin, et al.
Published: (2025)
by: Wang, Tevin, et al.
Published: (2025)
When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift
by: Le, Khoi, et al.
Published: (2026)
by: Le, Khoi, et al.
Published: (2026)
Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference
by: Qiu, Wenjie, et al.
Published: (2025)
by: Qiu, Wenjie, et al.
Published: (2025)
Reinforcement Learning Teachers of Test Time Scaling
by: Cetin, Edoardo, et al.
Published: (2025)
by: Cetin, Edoardo, et al.
Published: (2025)
AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization
by: Wu, Junkang, et al.
Published: (2024)
by: Wu, Junkang, et al.
Published: (2024)
The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability
by: Raju, Prashant C.
Published: (2026)
by: Raju, Prashant C.
Published: (2026)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
by: Rafailov, Rafael, et al.
Published: (2023)
by: Rafailov, Rafael, et al.
Published: (2023)
DreamReward: Text-to-3D Generation with Human Preference
by: Ye, Junliang, et al.
Published: (2024)
by: Ye, Junliang, et al.
Published: (2024)
SimPO: Simple Preference Optimization with a Reference-Free Reward
by: Meng, Yu, et al.
Published: (2024)
by: Meng, Yu, et al.
Published: (2024)
On the Power of (Approximate) Reward Models for Inference-Time Scaling
by: Zhu, Youheng, et al.
Published: (2026)
by: Zhu, Youheng, et al.
Published: (2026)
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
by: Wang, Haoxiang, et al.
Published: (2024)
by: Wang, Haoxiang, et al.
Published: (2024)
VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers
by: Qi, Jianing, et al.
Published: (2024)
by: Qi, Jianing, et al.
Published: (2024)
Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model
by: Hong, Yuzhong, et al.
Published: (2024)
by: Hong, Yuzhong, et al.
Published: (2024)
Integrating Time Series into LLMs via Multi-layer Steerable Embedding Fusion for Enhanced Forecasting
by: Chen, Zhuomin, et al.
Published: (2025)
by: Chen, Zhuomin, et al.
Published: (2025)
Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
by: Xu, Ran, et al.
Published: (2026)
by: Xu, Ran, et al.
Published: (2026)
Test-Time Scaling with Reflective Generative Model
by: Wang, Zixiao, et al.
Published: (2025)
by: Wang, Zixiao, et al.
Published: (2025)
Inference-Time Scaling for Generalist Reward Modeling
by: Liu, Zijun, et al.
Published: (2025)
by: Liu, Zijun, et al.
Published: (2025)
A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs
by: Chang, Trenton, et al.
Published: (2025)
by: Chang, Trenton, et al.
Published: (2025)
Similar Items
-
On the Robustness of Reward Models for Language Model Alignment
by: Hong, Jiwoo, et al.
Published: (2025) -
Preference Poisoning Attacks on Reward Model Learning
by: Wu, Junlin, et al.
Published: (2024) -
Generalizing Reward Modeling for Out-of-Distribution Preference Learning
by: Jia, Chen
Published: (2024) -
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
by: Karlekar, Sweta, et al.
Published: (2026) -
Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input
by: Peng, Andi, et al.
Published: (2024)