Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lin, Yalan, Wan, Chengcheng, Fang, Yixiong, Gu, Xiaodong
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2410.05797
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916427742576640
author	Lin, Yalan Wan, Chengcheng Fang, Yixiong Gu, Xiaodong
author_facet	Lin, Yalan Wan, Chengcheng Fang, Yixiong Gu, Xiaodong
contents	While large code language models have made significant strides in AI-assisted coding tasks, there are growing concerns about privacy challenges. The user code is transparent to the cloud LLM service provider, inducing risks of unauthorized training, reading, and execution of the user code. In this paper, we propose CodeCipher, a novel method that perturbs privacy from code while preserving the original response from LLMs. CodeCipher transforms the LLM's embedding matrix so that each row corresponds to a different word in the original matrix, forming a token-to-token confusion mapping for obfuscating source code. The new embedding matrix is optimized by minimizing the task-specific loss function. To tackle the challenge of the discrete and sparse nature of word vector spaces, CodeCipher adopts a discrete optimization strategy that aligns the updated vector to the nearest valid token in the vocabulary before each gradient update. We demonstrate the effectiveness of our approach on three AI-assisted coding tasks including code completion, summarization, and translation. Results show that our model successfully confuses the privacy in source code while preserving the original LLM's performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_05797
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	CodeCipher: Learning to Obfuscate Source Code Against LLMs Lin, Yalan Wan, Chengcheng Fang, Yixiong Gu, Xiaodong Computation and Language While large code language models have made significant strides in AI-assisted coding tasks, there are growing concerns about privacy challenges. The user code is transparent to the cloud LLM service provider, inducing risks of unauthorized training, reading, and execution of the user code. In this paper, we propose CodeCipher, a novel method that perturbs privacy from code while preserving the original response from LLMs. CodeCipher transforms the LLM's embedding matrix so that each row corresponds to a different word in the original matrix, forming a token-to-token confusion mapping for obfuscating source code. The new embedding matrix is optimized by minimizing the task-specific loss function. To tackle the challenge of the discrete and sparse nature of word vector spaces, CodeCipher adopts a discrete optimization strategy that aligns the updated vector to the nearest valid token in the vocabulary before each gradient update. We demonstrate the effectiveness of our approach on three AI-assisted coding tasks including code completion, summarization, and translation. Results show that our model successfully confuses the privacy in source code while preserving the original LLM's performance.
title	CodeCipher: Learning to Obfuscate Source Code Against LLMs
topic	Computation and Language
url	https://arxiv.org/abs/2410.05797

Similar Items