Saved in:
| Main Author: | Maity, Dipan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.04651 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AuON: A Linear-time Alternative to Orthogonal Momentum Updates
by: Maity, Dipan
Published: (2025)
by: Maity, Dipan
Published: (2025)
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
by: Lee, Harrison, et al.
Published: (2023)
by: Lee, Harrison, et al.
Published: (2023)
Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
by: Maity, Dipan, et al.
Published: (2026)
by: Maity, Dipan, et al.
Published: (2026)
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
by: Ji, Jiaming, et al.
Published: (2025)
by: Ji, Jiaming, et al.
Published: (2025)
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
by: Chaudhari, Shreyas, et al.
Published: (2024)
by: Chaudhari, Shreyas, et al.
Published: (2024)
ACE-RLHF: Automated Code Evaluation and Socratic Feedback Generation Tool using Large Language Models and Reinforcement Learning with Human Feedback
by: Rahman, Tasnia, et al.
Published: (2025)
by: Rahman, Tasnia, et al.
Published: (2025)
Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback
by: Yuan, Yifu, et al.
Published: (2024)
by: Yuan, Yifu, et al.
Published: (2024)
The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback
by: Lambert, Nathan, et al.
Published: (2023)
by: Lambert, Nathan, et al.
Published: (2023)
SAFE-RL: Saliency-Aware Counterfactual Explainer for Deep Reinforcement Learning Policies
by: Samadi, Amir, et al.
Published: (2024)
by: Samadi, Amir, et al.
Published: (2024)
RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders
by: Yang, Zhongheng, et al.
Published: (2025)
by: Yang, Zhongheng, et al.
Published: (2025)
HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning
by: Hiranaka, Ayano, et al.
Published: (2024)
by: Hiranaka, Ayano, et al.
Published: (2024)
Mitigating the Alignment Tax of RLHF
by: Lin, Yong, et al.
Published: (2023)
by: Lin, Yong, et al.
Published: (2023)
PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback
by: Chakraborty, Souradip, et al.
Published: (2023)
by: Chakraborty, Souradip, et al.
Published: (2023)
MaxMin-RLHF: Alignment with Diverse Human Preferences
by: Chakraborty, Souradip, et al.
Published: (2024)
by: Chakraborty, Souradip, et al.
Published: (2024)
Reinforcement Learning from Human Feedback
by: Lambert, Nathan
Published: (2025)
by: Lambert, Nathan
Published: (2025)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
by: Xiong, Wei, et al.
Published: (2023)
by: Xiong, Wei, et al.
Published: (2023)
Trajectory Entropy Reinforcement Learning for Predictable and Robust Control
by: You, Bang, et al.
Published: (2025)
by: You, Bang, et al.
Published: (2025)
Failure Modes of Maximum Entropy RLHF
by: Çağatan, Ömer Veysel, et al.
Published: (2025)
by: Çağatan, Ömer Veysel, et al.
Published: (2025)
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF
by: Chegini, Atoosa, et al.
Published: (2024)
by: Chegini, Atoosa, et al.
Published: (2024)
Strategyproof Reinforcement Learning from Human Feedback
by: Buening, Thomas Kleine, et al.
Published: (2025)
by: Buening, Thomas Kleine, et al.
Published: (2025)
Explaining and Preventing Alignment Collapse in Iterative RLHF
by: Gauthier, Etienne, et al.
Published: (2026)
by: Gauthier, Etienne, et al.
Published: (2026)
Evaluating Defences against Unsafe Feedback in RLHF
by: Rosati, Domenic, et al.
Published: (2024)
by: Rosati, Domenic, et al.
Published: (2024)
Understanding the Learning Dynamics of Alignment with Human Feedback
by: Im, Shawn, et al.
Published: (2024)
by: Im, Shawn, et al.
Published: (2024)
Robust Reinforcement Learning from Corrupted Human Feedback
by: Bukharin, Alexander, et al.
Published: (2024)
by: Bukharin, Alexander, et al.
Published: (2024)
Dual Active Learning for Reinforcement Learning from Human Feedback
by: Liu, Pangpang, et al.
Published: (2024)
by: Liu, Pangpang, et al.
Published: (2024)
Flexible Blood Glucose Control: Offline Reinforcement Learning from Human Feedback
by: Emerson, Harry, et al.
Published: (2025)
by: Emerson, Harry, et al.
Published: (2025)
Beyond RLHF: A Unified Theoretical Framework of Alignment
by: Yun, Jihun, et al.
Published: (2025)
by: Yun, Jihun, et al.
Published: (2025)
Enhancing RLHF with Human Gaze Modeling
by: Galliamov, Karim, et al.
Published: (2025)
by: Galliamov, Karim, et al.
Published: (2025)
Solving the Inverse Alignment Problem for Efficient RLHF
by: Krishna, Shambhavi, et al.
Published: (2024)
by: Krishna, Shambhavi, et al.
Published: (2024)
Towards Reliable Alignment: Uncertainty-aware RLHF
by: Banerjee, Debangshu, et al.
Published: (2024)
by: Banerjee, Debangshu, et al.
Published: (2024)
RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation
by: Park, Chanwoo, et al.
Published: (2024)
by: Park, Chanwoo, et al.
Published: (2024)
Unifying Stable Optimization and Reference Regularization in RLHF
by: He, Li, et al.
Published: (2026)
by: He, Li, et al.
Published: (2026)
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
by: Hahm, Dongyoon, et al.
Published: (2026)
by: Hahm, Dongyoon, et al.
Published: (2026)
Reinforcement Learning from Human Feedback: A Statistical Perspective
by: Liu, Pangpang, et al.
Published: (2026)
by: Liu, Pangpang, et al.
Published: (2026)
Reinforcement Learning from Multi-level and Episodic Human Feedback
by: Elahi, Muhammad Qasim, et al.
Published: (2025)
by: Elahi, Muhammad Qasim, et al.
Published: (2025)
A Minimaximalist Approach to Reinforcement Learning from Human Feedback
by: Swamy, Gokul, et al.
Published: (2024)
by: Swamy, Gokul, et al.
Published: (2024)
Dense Reward for Free in Reinforcement Learning from Human Feedback
by: Chan, Alex J., et al.
Published: (2024)
by: Chan, Alex J., et al.
Published: (2024)
Multi-turn Reinforcement Learning from Preference Human Feedback
by: Shani, Lior, et al.
Published: (2024)
by: Shani, Lior, et al.
Published: (2024)
Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF
by: Shen, Han, et al.
Published: (2024)
by: Shen, Han, et al.
Published: (2024)
Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
by: Sahoo, Subramanyam, et al.
Published: (2025)
by: Sahoo, Subramanyam, et al.
Published: (2025)
Similar Items
-
AuON: A Linear-time Alternative to Orthogonal Momentum Updates
by: Maity, Dipan
Published: (2025) -
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
by: Lee, Harrison, et al.
Published: (2023) -
Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
by: Maity, Dipan, et al.
Published: (2026) -
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
by: Ji, Jiaming, et al.
Published: (2025) -
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
by: Chaudhari, Shreyas, et al.
Published: (2024)