:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zheng, Chen, Sun, Ke, Wu, Hang, Xi, Chenguang, Zhou, Xun
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2403.02513
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs
by: Zheng, Chen, et al.
Published: (2024)

ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers
by: Zheng, Chen, et al.
Published: (2024)

Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models
by: Zheng, Chen, et al.
Published: (2025)

WPO: Enhancing RLHF with Weighted Preference Optimization
by: Zhou, Wenxuan, et al.
Published: (2024)

Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs
by: Kong, Jiawei, et al.
Published: (2025)

Dishonesty in Helpful and Harmless Alignment
by: Huang, Youcheng, et al.
Published: (2024)

Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning
by: Zhou, Qinhao, et al.
Published: (2024)

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity
by: Cui, Shiyao, et al.
Published: (2023)

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
by: Zheng, Shen, et al.
Published: (2023)

Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging
by: Yang, Jinluan, et al.
Published: (2025)

Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs
by: Lin, Shuyuan, et al.
Published: (2025)

Taming Overconfidence in LLMs: Reward Calibration in RLHF
by: Leng, Jixuan, et al.
Published: (2024)

Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
by: Wang, Pengyu, et al.
Published: (2024)

Towards Federated RLHF with Aggregated Client Preference for LLMs
by: Wu, Feijie, et al.
Published: (2024)

H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
by: Tekin, Selim Furkan, et al.
Published: (2024)

A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
by: Acikgoz, Emre Can, et al.
Published: (2025)

We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
by: Ji, Jiaming, et al.
Published: (2024)

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
by: Xi, Zhiheng, et al.
Published: (2025)

Continual SFT Matches Multimodal RLHF with Negative Supervision
by: Zhu, Ke, et al.
Published: (2024)

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin
by: Xiong, Jian, et al.
Published: (2025)

Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
by: Chehbouni, Khaoula, et al.
Published: (2024)

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
by: Chai, Yekun, et al.
Published: (2024)

DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
by: Lu, Xiangyu, et al.
Published: (2025)

More Than Catastrophic Forgetting: Integrating General Capabilities For Domain-Specific LLMs
by: Liu, Chengyuan, et al.
Published: (2024)

Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
by: Sun, Chung-En, et al.
Published: (2024)

MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities
by: Chen, Jingxue, et al.
Published: (2025)

RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
by: Hu, Jian, et al.
Published: (2024)

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards
by: Jørgenvåg, Magnus, et al.
Published: (2026)

Too Helpful, Too Harmless, Too Honest or Just Right?
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)

PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs
by: Xu, Jing, et al.
Published: (2026)

Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation
by: Zhou, Ziya, et al.
Published: (2024)

Towards Harmless Multimodal Assistants with Blind Preference Optimization
by: Li, Yongqi, et al.
Published: (2025)

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
by: Yin, Yueqin, et al.
Published: (2025)

Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities
by: Spiegelhalter, Urs, et al.
Published: (2025)

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
by: Wang, Xiaobo, et al.
Published: (2026)

From Perceptions to Decisions: Wildfire Evacuation Decision Prediction with Behavioral Theory-informed LLMs
by: Chen, Ruxiao, et al.
Published: (2025)

Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs
by: Liao, Mengqi, et al.
Published: (2025)