Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Cai, Zikui, Shabihi, Shayan, An, Bang, Che, Zora, Bartoldson, Brian R., Kailkhura, Bhavya, Goldstein, Tom, Huang, Furong
Format:	Preprint
Publié:	2025
Sujets:	Machine Learning
Accès en ligne:	https://arxiv.org/abs/2504.20965
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866909648908451840
author	Cai, Zikui Shabihi, Shayan An, Bang Che, Zora Bartoldson, Brian R. Kailkhura, Bhavya Goldstein, Tom Huang, Furong
author_facet	Cai, Zikui Shabihi, Shayan An, Bang Che, Zora Bartoldson, Brian R. Kailkhura, Bhavya Goldstein, Tom Huang, Furong
contents	We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Code is available at https://github.com/zikuicai/aegisllm
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_20965
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security Cai, Zikui Shabihi, Shayan An, Bang Che, Zora Bartoldson, Brian R. Kailkhura, Bhavya Goldstein, Tom Huang, Furong Machine Learning We introduce AegisLLM, a cooperative multi-agent defense against adversarial attacks and information leakage. In AegisLLM, a structured workflow of autonomous agents - orchestrator, deflector, responder, and evaluator - collaborate to ensure safe and compliant LLM outputs, while self-improving over time through prompt optimization. We show that scaling agentic reasoning system at test-time - both by incorporating additional agent roles and by leveraging automated prompt optimization (such as DSPy)- substantially enhances robustness without compromising model utility. This test-time defense enables real-time adaptability to evolving attacks, without requiring model retraining. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM. On the WMDP unlearning benchmark, AegisLLM achieves near-perfect unlearning with only 20 training examples and fewer than 300 LM calls. For jailbreaking benchmarks, we achieve 51% improvement compared to the base model on StrongReject, with false refusal rates of only 7.9% on PHTest compared to 18-55% for comparable methods. Our results highlight the advantages of adaptive, agentic reasoning over static defenses, establishing AegisLLM as a strong runtime alternative to traditional approaches based on model modifications. Code is available at https://github.com/zikuicai/aegisllm
title	AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security
topic	Machine Learning
url	https://arxiv.org/abs/2504.20965

Documents similaires