Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Choi, Soohyeon, Tan, Yong Kiam, Meng, Mark Huasong, Ragab, Mohamed, Mondal, Soumik, Mohaisen, David, Aung, Khin Mi Mi
Format:	Preprint
Published:	2025
Subjects:	Software Engineering Artificial Intelligence
Online Access:	https://arxiv.org/abs/2501.08165
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909456503144448
author	Choi, Soohyeon Tan, Yong Kiam Meng, Mark Huasong Ragab, Mohamed Mondal, Soumik Mohaisen, David Aung, Khin Mi Mi
author_facet	Choi, Soohyeon Tan, Yong Kiam Meng, Mark Huasong Ragab, Mohamed Mondal, Soumik Mohaisen, David Aung, Khin Mi Mi
contents	Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysis using large language models (LLMs), which have shown exceptional performance without task-specific tuning, this paper explores the use of LLMs for source code authorship attribution. We present a comprehensive study demonstrating that state-of-the-art LLMs can successfully attribute source code authorship across different languages. LLMs can determine whether two code snippets are written by the same author with zero-shot prompting, achieving a Matthews Correlation Coefficient (MCC) of 0.78, and can attribute code authorship from a small set of reference code snippets via few-shot learning, achieving MCC of 0.77. Additionally, LLMs show some adversarial robustness against misattribution attacks. Despite these capabilities, we found that naive prompting of LLMs does not scale well with a large number of authors due to input token limitations. To address this, we propose a tournament-style approach for large-scale attribution. Evaluating this approach on datasets of C++ (500 authors, 26,355 samples) and Java (686 authors, 55,267 samples) code from GitHub, we achieve classification accuracy of up to 65% for C++ and 68.7% for Java using only one reference per author. These results open new possibilities for applying LLMs to code authorship attribution in cybersecurity and software engineering.
format	Preprint
id	arxiv_https___arxiv_org_abs_2501_08165
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution Choi, Soohyeon Tan, Yong Kiam Meng, Mark Huasong Ragab, Mohamed Mondal, Soumik Mohaisen, David Aung, Khin Mi Mi Software Engineering Artificial Intelligence Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysis using large language models (LLMs), which have shown exceptional performance without task-specific tuning, this paper explores the use of LLMs for source code authorship attribution. We present a comprehensive study demonstrating that state-of-the-art LLMs can successfully attribute source code authorship across different languages. LLMs can determine whether two code snippets are written by the same author with zero-shot prompting, achieving a Matthews Correlation Coefficient (MCC) of 0.78, and can attribute code authorship from a small set of reference code snippets via few-shot learning, achieving MCC of 0.77. Additionally, LLMs show some adversarial robustness against misattribution attacks. Despite these capabilities, we found that naive prompting of LLMs does not scale well with a large number of authors due to input token limitations. To address this, we propose a tournament-style approach for large-scale attribution. Evaluating this approach on datasets of C++ (500 authors, 26,355 samples) and Java (686 authors, 55,267 samples) code from GitHub, we achieve classification accuracy of up to 65% for C++ and 68.7% for Java using only one reference per author. These results open new possibilities for applying LLMs to code authorship attribution in cybersecurity and software engineering.
title	I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution
topic	Software Engineering Artificial Intelligence
url	https://arxiv.org/abs/2501.08165

Similar Items