:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Nguyen, Khanh
Format:	Preprint
Published:	2023
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2305.17760
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Understanding Understanding: A Pragmatic Framework Motivated by Large Language Models
by: Leyton-Brown, Kevin, et al.
Published: (2024)

RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)

Language-Guided World Models: A Model-Based Approach to AI Control
by: Zhang, Alex, et al.
Published: (2024)

Automatic Prompt Selection for Large Language Models
by: Do, Viet-Tung, et al.
Published: (2024)

Understanding the Effects of RLHF on LLM Generalisation and Diversity
by: Kirk, Robert, et al.
Published: (2023)

Optimizing RLHF Training for Large Language Models with Stage Fusion
by: Zhong, Yinmin, et al.
Published: (2024)

Reward Model Overoptimisation in Iterated RLHF
by: Wolf, Lorenz, et al.
Published: (2025)

How to Evaluate Reward Models for RLHF
by: Frick, Evan, et al.
Published: (2024)

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
by: Shi, Ruizhe, et al.
Published: (2025)

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy
by: Zhu, Yu, et al.
Published: (2024)

Quantile Regression for Distributional Reward Models in RLHF
by: Dorka, Nicolai
Published: (2024)

Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language Models
by: Dam, Harvey, et al.
Published: (2025)

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
by: Noukhovitch, Michael, et al.
Published: (2024)

Does RLHF Scale? Exploring the Impacts From Data, Model, and Method
by: Hou, Zhenyu, et al.
Published: (2024)

Understanding Emergent Abilities of Language Models from the Loss Perspective
by: Du, Zhengxiao, et al.
Published: (2024)

Reward Generalization in RLHF: A Topological Perspective
by: Qiu, Tianyi, et al.
Published: (2024)

DocMIA: Document-Level Membership Inference Attacks against DocVQA Models
by: Nguyen, Khanh, et al.
Published: (2025)

Advancing Translation Preference Modeling with RLHF: A Step Towards Cost-Effective Solution
by: Xu, Nuo, et al.
Published: (2024)

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF
by: Lu, Taiming, et al.
Published: (2024)

Failure Modes of Maximum Entropy RLHF
by: Çağatan, Ömer Veysel, et al.
Published: (2025)

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
by: Hu, Jian, et al.
Published: (2024)

ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
by: Mei, Zhiyu, et al.
Published: (2024)

Evaluating Defences against Unsafe Feedback in RLHF
by: Rosati, Domenic, et al.
Published: (2024)

Solving the Inverse Alignment Problem for Efficient RLHF
by: Krishna, Shambhavi, et al.
Published: (2024)

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
by: Wang, Hao, et al.
Published: (2026)

Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation
by: Wang, Xinyi, et al.
Published: (2024)

Bayesian Mixture of Experts For Large Language Models
by: Dialameh, Maryam, et al.
Published: (2025)

RLHF and IIA: Perverse Incentives
by: Xu, Wanqiao, et al.
Published: (2023)

Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)

Federated Document Visual Question Answering: A Pilot Study
by: Nguyen, Khanh, et al.
Published: (2024)

Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization
by: Nath, Swaroop, et al.
Published: (2024)

Group Robust Preference Optimization in Reward-free RLHF
by: Ramesh, Shyam Sundhar, et al.
Published: (2024)

Why Is RLHF Alignment Shallow? A Gradient Analysis
by: Young, Robin
Published: (2026)

Deep Bayesian Active Learning for Preference Modeling in Large Language Models
by: Melo, Luckeciano C., et al.
Published: (2024)

Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?
by: Aggazzotti, Cristina, et al.
Published: (2023)

How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective
by: Liu, Qi, et al.
Published: (2025)

Distributional Surgery for Language Model Activations
by: Nguyen, Bao, et al.
Published: (2025)

Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time
by: Zhang, Zhenyu, et al.
Published: (2025)

Understanding the Language Model to Solve the Symbolic Multi-Step Reasoning Problem from the Perspective of Buffer Mechanism
by: Wang, Zhiwei, et al.
Published: (2024)

A Long Way to Go: Investigating Length Correlations in RLHF
by: Singhal, Prasann, et al.
Published: (2023)