Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Meifang, Yang, Zhe, Nianchen, Huang, Huang, Yizhan, Li, Yichen, Li, Zihan, Lyu, Michael R.
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.17814
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908978618826752
author	Chen, Meifang Yang, Zhe Nianchen, Huang Huang, Yizhan Li, Yichen Li, Zihan Lyu, Michael R.
author_facet	Chen, Meifang Yang, Zhe Nianchen, Huang Huang, Yizhan Li, Yichen Li, Zihan Lyu, Michael R.
contents	Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_17814
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective Chen, Meifang Yang, Zhe Nianchen, Huang Huang, Yizhan Li, Yichen Li, Zihan Lyu, Michael R. Cryptography and Security Artificial Intelligence Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.
title	Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective
topic	Cryptography and Security Artificial Intelligence
url	https://arxiv.org/abs/2604.17814

Similar Items