Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jin, David, Fu, Qian, Li, Yuekang
Format:	Preprint
Published:	2025
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.01065
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916717065666560
author	Jin, David Fu, Qian Li, Yuekang
author_facet	Jin, David Fu, Qian Li, Yuekang
contents	Large Language Models (LLMs) have demonstrated remarkable capabilities in code-related tasks, raising concerns about their potential for automated exploit generation (AEG). This paper presents the first systematic study on LLMs' effectiveness in AEG, evaluating both their cooperativeness and technical proficiency. To mitigate dataset bias, we introduce a benchmark with refactored versions of five software security labs. Additionally, we design an LLM-based attacker to systematically prompt LLMs for exploit generation. Our experiments reveal that GPT-4 and GPT-4o exhibit high cooperativeness, comparable to uncensored models, while Llama3 is the most resistant. However, no model successfully generates exploits for refactored labs, though GPT-4o's minimal errors highlight the potential for LLM-driven AEG advancements.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_01065
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Good News for Script Kiddies? Evaluating Large Language Models for Automated Exploit Generation Jin, David Fu, Qian Li, Yuekang Cryptography and Security Artificial Intelligence Large Language Models (LLMs) have demonstrated remarkable capabilities in code-related tasks, raising concerns about their potential for automated exploit generation (AEG). This paper presents the first systematic study on LLMs' effectiveness in AEG, evaluating both their cooperativeness and technical proficiency. To mitigate dataset bias, we introduce a benchmark with refactored versions of five software security labs. Additionally, we design an LLM-based attacker to systematically prompt LLMs for exploit generation. Our experiments reveal that GPT-4 and GPT-4o exhibit high cooperativeness, comparable to uncensored models, while Llama3 is the most resistant. However, no model successfully generates exploits for refactored labs, though GPT-4o's minimal errors highlight the potential for LLM-driven AEG advancements.
title	Good News for Script Kiddies? Evaluating Large Language Models for Automated Exploit Generation
topic	Cryptography and Security Artificial Intelligence
url	https://arxiv.org/abs/2505.01065

Similar Items