Saved in:
| Main Authors: | Zheng, Chen, Sun, Ke, Wu, Hang, Xi, Chenguang, Zhou, Xun |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.02513 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs
by: Zheng, Chen, et al.
Published: (2024)
by: Zheng, Chen, et al.
Published: (2024)
ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers
by: Zheng, Chen, et al.
Published: (2024)
by: Zheng, Chen, et al.
Published: (2024)
Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models
by: Zheng, Chen, et al.
Published: (2025)
by: Zheng, Chen, et al.
Published: (2025)
WPO: Enhancing RLHF with Weighted Preference Optimization
by: Zhou, Wenxuan, et al.
Published: (2024)
by: Zhou, Wenxuan, et al.
Published: (2024)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs
by: Kong, Jiawei, et al.
Published: (2025)
by: Kong, Jiawei, et al.
Published: (2025)
Dishonesty in Helpful and Harmless Alignment
by: Huang, Youcheng, et al.
Published: (2024)
by: Huang, Youcheng, et al.
Published: (2024)
Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning
by: Zhou, Qinhao, et al.
Published: (2024)
by: Zhou, Qinhao, et al.
Published: (2024)
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity
by: Cui, Shiyao, et al.
Published: (2023)
by: Cui, Shiyao, et al.
Published: (2023)
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
by: Zheng, Shen, et al.
Published: (2023)
by: Zheng, Shen, et al.
Published: (2023)
Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging
by: Yang, Jinluan, et al.
Published: (2025)
by: Yang, Jinluan, et al.
Published: (2025)
Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs
by: Lin, Shuyuan, et al.
Published: (2025)
by: Lin, Shuyuan, et al.
Published: (2025)
Taming Overconfidence in LLMs: Reward Calibration in RLHF
by: Leng, Jixuan, et al.
Published: (2024)
by: Leng, Jixuan, et al.
Published: (2024)
Reward-Robust RLHF in LLMs
by: Yan, Yuzi, et al.
Published: (2024)
by: Yan, Yuzi, et al.
Published: (2024)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
by: Wang, Pengyu, et al.
Published: (2024)
by: Wang, Pengyu, et al.
Published: (2024)
Towards Federated RLHF with Aggregated Client Preference for LLMs
by: Wu, Feijie, et al.
Published: (2024)
by: Wu, Feijie, et al.
Published: (2024)
H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
by: Tekin, Selim Furkan, et al.
Published: (2024)
by: Tekin, Selim Furkan, et al.
Published: (2024)
A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
by: Acikgoz, Emre Can, et al.
Published: (2025)
by: Acikgoz, Emre Can, et al.
Published: (2025)
We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
by: Ji, Jiaming, et al.
Published: (2024)
by: Ji, Jiaming, et al.
Published: (2024)
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
by: Xi, Zhiheng, et al.
Published: (2025)
by: Xi, Zhiheng, et al.
Published: (2025)
Continual SFT Matches Multimodal RLHF with Negative Supervision
by: Zhu, Ke, et al.
Published: (2024)
by: Zhu, Ke, et al.
Published: (2024)
AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin
by: Xiong, Jian, et al.
Published: (2025)
by: Xiong, Jian, et al.
Published: (2025)
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
by: Chehbouni, Khaoula, et al.
Published: (2024)
by: Chehbouni, Khaoula, et al.
Published: (2024)
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
by: Chai, Yekun, et al.
Published: (2024)
by: Chai, Yekun, et al.
Published: (2024)
DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
by: Lu, Xiangyu, et al.
Published: (2025)
by: Lu, Xiangyu, et al.
Published: (2025)
More Than Catastrophic Forgetting: Integrating General Capabilities For Domain-Specific LLMs
by: Liu, Chengyuan, et al.
Published: (2024)
by: Liu, Chengyuan, et al.
Published: (2024)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
by: Sun, Chung-En, et al.
Published: (2024)
by: Sun, Chung-En, et al.
Published: (2024)
MoL for LLMs: Dual-Loss Optimization to Enhance Domain Expertise While Preserving General Capabilities
by: Chen, Jingxue, et al.
Published: (2025)
by: Chen, Jingxue, et al.
Published: (2025)
RLHF Workflow: From Reward Modeling to Online RLHF
by: Dong, Hanze, et al.
Published: (2024)
by: Dong, Hanze, et al.
Published: (2024)
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
by: Hu, Jian, et al.
Published: (2024)
by: Hu, Jian, et al.
Published: (2024)
Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards
by: Jørgenvåg, Magnus, et al.
Published: (2026)
by: Jørgenvåg, Magnus, et al.
Published: (2026)
Too Helpful, Too Harmless, Too Honest or Just Right?
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
PROST-LLM: Progressively Enhancing the Speech-to-Speech Translation Capability in LLMs
by: Xu, Jing, et al.
Published: (2026)
by: Xu, Jing, et al.
Published: (2026)
Can LLMs "Reason" in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation
by: Zhou, Ziya, et al.
Published: (2024)
by: Zhou, Ziya, et al.
Published: (2024)
Towards Harmless Multimodal Assistants with Blind Preference Optimization
by: Li, Yongqi, et al.
Published: (2025)
by: Li, Yongqi, et al.
Published: (2025)
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
by: Yin, Yueqin, et al.
Published: (2025)
by: Yin, Yueqin, et al.
Published: (2025)
Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities
by: Spiegelhalter, Urs, et al.
Published: (2025)
by: Spiegelhalter, Urs, et al.
Published: (2025)
The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
by: Wang, Xiaobo, et al.
Published: (2026)
by: Wang, Xiaobo, et al.
Published: (2026)
From Perceptions to Decisions: Wildfire Evacuation Decision Prediction with Behavioral Theory-informed LLMs
by: Chen, Ruxiao, et al.
Published: (2025)
by: Chen, Ruxiao, et al.
Published: (2025)
Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs
by: Liao, Mengqi, et al.
Published: (2025)
by: Liao, Mengqi, et al.
Published: (2025)
Similar Items
-
Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs
by: Zheng, Chen, et al.
Published: (2024) -
ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers
by: Zheng, Chen, et al.
Published: (2024) -
Balanced Actor Initialization: Stable RLHF Training of Distillation-Based Reasoning Models
by: Zheng, Chen, et al.
Published: (2025) -
WPO: Enhancing RLHF with Weighted Preference Optimization
by: Zhou, Wenxuan, et al.
Published: (2024) -
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs
by: Kong, Jiawei, et al.
Published: (2025)