Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zahan, Nusrat, Burckhardt, Philipp, Lysenko, Mikola, Aboukhadijeh, Feross, Williams, Laurie
Format:	Preprint
Published:	2024
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2403.12196
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916551541653504
author	Zahan, Nusrat Burckhardt, Philipp Lysenko, Mikola Aboukhadijeh, Feross Williams, Laurie
author_facet	Zahan, Nusrat Burckhardt, Philipp Lysenko, Mikola Aboukhadijeh, Feross Williams, Laurie
contents	Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness of SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed. and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious activities detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Pre-screening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_12196
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Leveraging Large Language Models to Detect npm Malicious Packages Zahan, Nusrat Burckhardt, Philipp Lysenko, Mikola Aboukhadijeh, Feross Williams, Laurie Cryptography and Security Artificial Intelligence Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness of SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed. and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious activities detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Pre-screening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.
title	Leveraging Large Language Models to Detect npm Malicious Packages
topic	Cryptography and Security Artificial Intelligence
url	https://arxiv.org/abs/2403.12196

Similar Items