Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Runchuan, Jiang, Zinco, Wu, Jiang, Ma, Zhipeng, Song, Jiahe, Bai, Fengshuo, Lin, Dahua, Wu, Lijun, He, Conghui
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.05911
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909485451182080
author	Zhu, Runchuan Jiang, Zinco Wu, Jiang Ma, Zhipeng Song, Jiahe Bai, Fengshuo Lin, Dahua Wu, Lijun He, Conghui
author_facet	Zhu, Runchuan Jiang, Zinco Wu, Jiang Ma, Zhipeng Song, Jiahe Bai, Fengshuo Lin, Dahua Wu, Lijun He, Conghui
contents	Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_05911
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation Zhu, Runchuan Jiang, Zinco Wu, Jiang Ma, Zhipeng Song, Jiahe Bai, Fengshuo Lin, Dahua Wu, Lijun He, Conghui Computation and Language Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .
title	GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation
topic	Computation and Language
url	https://arxiv.org/abs/2502.05911

Similar Items