Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lu, Huimin, Isonuma, Masaru, Mori, Junichiro, Sakata, Ichiro
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2407.16951
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914884923424768
author	Lu, Huimin Isonuma, Masaru Mori, Junichiro Sakata, Ichiro
author_facet	Lu, Huimin Isonuma, Masaru Mori, Junichiro Sakata, Ichiro
contents	Large language models (LLMs) often inherit biases from vast amounts of training corpora. Traditional debiasing methods, while effective to some extent, do not completely eliminate memorized biases and toxicity in LLMs. In this paper, we study an unlearning-based approach to debiasing in LLMs by performing gradient ascent on hate speech against minority groups, i.e., minimizing the likelihood of biased or toxic content. Specifically, we propose a mask language modeling unlearning technique, which unlearns the harmful part of the text. This method enables LLMs to selectively forget and disassociate from biased and harmful content. Experimental results demonstrate the effectiveness of our approach in diminishing bias while maintaining the language modeling abilities. Surprisingly, the results also unveil an unexpected potential for cross-domain transfer unlearning: debiasing in one bias form (e.g. gender) may contribute to mitigating others (e.g. race and religion).
format	Preprint
id	arxiv_https___arxiv_org_abs_2407_16951
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation Lu, Huimin Isonuma, Masaru Mori, Junichiro Sakata, Ichiro Computation and Language Machine Learning Large language models (LLMs) often inherit biases from vast amounts of training corpora. Traditional debiasing methods, while effective to some extent, do not completely eliminate memorized biases and toxicity in LLMs. In this paper, we study an unlearning-based approach to debiasing in LLMs by performing gradient ascent on hate speech against minority groups, i.e., minimizing the likelihood of biased or toxic content. Specifically, we propose a mask language modeling unlearning technique, which unlearns the harmful part of the text. This method enables LLMs to selectively forget and disassociate from biased and harmful content. Experimental results demonstrate the effectiveness of our approach in diminishing bias while maintaining the language modeling abilities. Surprisingly, the results also unveil an unexpected potential for cross-domain transfer unlearning: debiasing in one bias form (e.g. gender) may contribute to mitigating others (e.g. race and religion).
title	Towards Transfer Unlearning: Empirical Evidence of Cross-Domain Bias Mitigation
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2407.16951

Similar Items