Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Guitao, Wang, Yunshen, Sun, Hongye, Chen, Guang
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Information Retrieval
Online Access:	https://arxiv.org/abs/2408.09459
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917752014372864
author	Chen, Guitao Wang, Yunshen Sun, Hongye Chen, Guang
author_facet	Chen, Guitao Wang, Yunshen Sun, Hongye Chen, Guang
contents	Generative language models (LMs) offer numerous advantages but may produce inappropriate or harmful outputs due to the harmful knowledge acquired during pre-training. This knowledge often manifests as undesirable correspondences, such as "harmful prompts" leading to "harmful outputs," which our research aims to mitigate through unlearning techniques.However, existing unlearning methods based on gradient ascent can significantly impair the performance of LMs. To address this issue, we propose a novel approach called Weighted Positional N-pair (WPN) Learning, which leverages position-weighted mean pooling within an n-pair contrastive learning framework. WPN is designed to modify the output distribution of LMs by eliminating specific harmful outputs (e.g., replacing toxic responses with neutral ones), thereby transforming the model's behavior from "harmful prompt-harmful output" to "harmful prompt-harmless response".Experiments on OPT and GPT-NEO LMs show that WPN effectively reduces the proportion of harmful responses, achieving a harmless rate of up to 95.8\% while maintaining stable performance on nine common benchmarks (with less than 2\% degradation on average). Moreover, we provide empirical evidence to demonstrate WPN's ability to weaken the harmful correspondences in terms of generalizability and robustness, as evaluated on out-of-distribution test sets and under adversarial attacks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_09459
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models Chen, Guitao Wang, Yunshen Sun, Hongye Chen, Guang Computation and Language Information Retrieval Generative language models (LMs) offer numerous advantages but may produce inappropriate or harmful outputs due to the harmful knowledge acquired during pre-training. This knowledge often manifests as undesirable correspondences, such as "harmful prompts" leading to "harmful outputs," which our research aims to mitigate through unlearning techniques.However, existing unlearning methods based on gradient ascent can significantly impair the performance of LMs. To address this issue, we propose a novel approach called Weighted Positional N-pair (WPN) Learning, which leverages position-weighted mean pooling within an n-pair contrastive learning framework. WPN is designed to modify the output distribution of LMs by eliminating specific harmful outputs (e.g., replacing toxic responses with neutral ones), thereby transforming the model's behavior from "harmful prompt-harmful output" to "harmful prompt-harmless response".Experiments on OPT and GPT-NEO LMs show that WPN effectively reduces the proportion of harmful responses, achieving a harmless rate of up to 95.8\% while maintaining stable performance on nine common benchmarks (with less than 2\% degradation on average). Moreover, we provide empirical evidence to demonstrate WPN's ability to weaken the harmful correspondences in terms of generalizability and robustness, as evaluated on out-of-distribution test sets and under adversarial attacks.
title	WPN: An Unlearning Method Based on N-pair Contrastive Learning in Language Models
topic	Computation and Language Information Retrieval
url	https://arxiv.org/abs/2408.09459

Similar Items