Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Yan, Liu, Yu, Chen, Xiaokang, Chen, Pin-Yu, Zan, Daoguang, Kan, Min-Yen, Ho, Tsung-Yi
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2406.10130
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916287447302144
author	Liu, Yan Liu, Yu Chen, Xiaokang Chen, Pin-Yu Zan, Daoguang Kan, Min-Yen Ho, Tsung-Yi
author_facet	Liu, Yan Liu, Yu Chen, Xiaokang Chen, Pin-Yu Zan, Daoguang Kan, Min-Yen Ho, Tsung-Yi
contents	Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train language models on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_10130
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models Liu, Yan Liu, Yu Chen, Xiaokang Chen, Pin-Yu Zan, Daoguang Kan, Min-Yen Ho, Tsung-Yi Computation and Language Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train language models on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.
title	The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models
topic	Computation and Language
url	https://arxiv.org/abs/2406.10130

Similar Items