Saved in:
Bibliographic Details
Main Authors: Ji, Xu, Zhang, Jianyi, Zhou, Ziyin, Zhao, Zhangchi, Qiao, Qianqian, Han, Kaiying, Hossen, Md Imran, Hei, Xiali
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2405.00718
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917655352442880
author Ji, Xu
Zhang, Jianyi
Zhou, Ziyin
Zhao, Zhangchi
Qiao, Qianqian
Han, Kaiying
Hossen, Md Imran
Hei, Xiali
author_facet Ji, Xu
Zhang, Jianyi
Zhou, Ziyin
Zhao, Zhangchi
Qiao, Qianqian
Han, Kaiying
Hossen, Md Imran
Hei, Xiali
contents Ensuring the resilience of Large Language Models (LLMs) against malicious exploitation is paramount, with recent focus on mitigating offensive responses. Yet, the understanding of cant or dark jargon remains unexplored. This paper introduces a domain-specific Cant dataset and CantCounter evaluation framework, employing Fine-Tuning, Co-Tuning, Data-Diffusion, and Data-Analysis stages. Experiments reveal LLMs, including ChatGPT, are susceptible to cant bypassing filters, with varying recognition accuracy influenced by question types, setups, and prompt clues. Updated models exhibit higher acceptance rates for cant queries. Moreover, LLM reactions differ across domains, e.g., reluctance to engage in racism versus LGBT topics. These findings underscore LLMs' understanding of cant and reflect training data characteristics and vendor approaches to sensitive topics. Additionally, we assess LLMs' ability to demonstrate reasoning capabilities. Access to our datasets and code is available at https://github.com/cistineup/CantCounter.
format Preprint
id arxiv_https___arxiv_org_abs_2405_00718
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Can't say cant? Measuring and Reasoning of Dark Jargons in Large Language Models
Ji, Xu
Zhang, Jianyi
Zhou, Ziyin
Zhao, Zhangchi
Qiao, Qianqian
Han, Kaiying
Hossen, Md Imran
Hei, Xiali
Computation and Language
Artificial Intelligence
Ensuring the resilience of Large Language Models (LLMs) against malicious exploitation is paramount, with recent focus on mitigating offensive responses. Yet, the understanding of cant or dark jargon remains unexplored. This paper introduces a domain-specific Cant dataset and CantCounter evaluation framework, employing Fine-Tuning, Co-Tuning, Data-Diffusion, and Data-Analysis stages. Experiments reveal LLMs, including ChatGPT, are susceptible to cant bypassing filters, with varying recognition accuracy influenced by question types, setups, and prompt clues. Updated models exhibit higher acceptance rates for cant queries. Moreover, LLM reactions differ across domains, e.g., reluctance to engage in racism versus LGBT topics. These findings underscore LLMs' understanding of cant and reflect training data characteristics and vendor approaches to sensitive topics. Additionally, we assess LLMs' ability to demonstrate reasoning capabilities. Access to our datasets and code is available at https://github.com/cistineup/CantCounter.
title Can't say cant? Measuring and Reasoning of Dark Jargons in Large Language Models
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2405.00718