:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Maity, Dipan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2602.04651
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AuON: A Linear-time Alternative to Orthogonal Momentum Updates
by: Maity, Dipan
Published: (2025)

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
by: Lee, Harrison, et al.
Published: (2023)

Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
by: Maity, Dipan, et al.
Published: (2026)

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
by: Ji, Jiaming, et al.
Published: (2025)

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
by: Chaudhari, Shreyas, et al.
Published: (2024)

ACE-RLHF: Automated Code Evaluation and Socratic Feedback Generation Tool using Large Language Models and Reinforcement Learning with Human Feedback
by: Rahman, Tasnia, et al.
Published: (2025)

Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback
by: Yuan, Yifu, et al.
Published: (2024)

The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from Human Feedback
by: Lambert, Nathan, et al.
Published: (2023)

SAFE-RL: Saliency-Aware Counterfactual Explainer for Deep Reinforcement Learning Policies
by: Samadi, Amir, et al.
Published: (2024)

RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders
by: Yang, Zhongheng, et al.
Published: (2025)

HERO: Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning
by: Hiranaka, Ayano, et al.
Published: (2024)

Mitigating the Alignment Tax of RLHF
by: Lin, Yong, et al.
Published: (2023)

PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback
by: Chakraborty, Souradip, et al.
Published: (2023)

MaxMin-RLHF: Alignment with Diverse Human Preferences
by: Chakraborty, Souradip, et al.
Published: (2024)

Reinforcement Learning from Human Feedback
by: Lambert, Nathan
Published: (2025)

Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint
by: Xiong, Wei, et al.
Published: (2023)

Trajectory Entropy Reinforcement Learning for Predictable and Robust Control
by: You, Bang, et al.
Published: (2025)

Failure Modes of Maximum Entropy RLHF
by: Çağatan, Ömer Veysel, et al.
Published: (2025)

SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF
by: Chegini, Atoosa, et al.
Published: (2024)

Strategyproof Reinforcement Learning from Human Feedback
by: Buening, Thomas Kleine, et al.
Published: (2025)

Explaining and Preventing Alignment Collapse in Iterative RLHF
by: Gauthier, Etienne, et al.
Published: (2026)

Evaluating Defences against Unsafe Feedback in RLHF
by: Rosati, Domenic, et al.
Published: (2024)

Understanding the Learning Dynamics of Alignment with Human Feedback
by: Im, Shawn, et al.
Published: (2024)

Robust Reinforcement Learning from Corrupted Human Feedback
by: Bukharin, Alexander, et al.
Published: (2024)

Dual Active Learning for Reinforcement Learning from Human Feedback
by: Liu, Pangpang, et al.
Published: (2024)

Flexible Blood Glucose Control: Offline Reinforcement Learning from Human Feedback
by: Emerson, Harry, et al.
Published: (2025)

Beyond RLHF: A Unified Theoretical Framework of Alignment
by: Yun, Jihun, et al.
Published: (2025)

Enhancing RLHF with Human Gaze Modeling
by: Galliamov, Karim, et al.
Published: (2025)

Solving the Inverse Alignment Problem for Efficient RLHF
by: Krishna, Shambhavi, et al.
Published: (2024)

Towards Reliable Alignment: Uncertainty-aware RLHF
by: Banerjee, Debangshu, et al.
Published: (2024)

RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation
by: Park, Chanwoo, et al.
Published: (2024)

Unifying Stable Optimization and Reference Regularization in RLHF
by: He, Li, et al.
Published: (2026)

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
by: Hahm, Dongyoon, et al.
Published: (2026)

Reinforcement Learning from Human Feedback: A Statistical Perspective
by: Liu, Pangpang, et al.
Published: (2026)

Reinforcement Learning from Multi-level and Episodic Human Feedback
by: Elahi, Muhammad Qasim, et al.
Published: (2025)

A Minimaximalist Approach to Reinforcement Learning from Human Feedback
by: Swamy, Gokul, et al.
Published: (2024)

Dense Reward for Free in Reinforcement Learning from Human Feedback
by: Chan, Alex J., et al.
Published: (2024)

Multi-turn Reinforcement Learning from Preference Human Feedback
by: Shani, Lior, et al.
Published: (2024)

Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF
by: Shen, Han, et al.
Published: (2024)

Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
by: Sahoo, Subramanyam, et al.
Published: (2025)