Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mehta, Astha, Selvanayagam, Niruthiha, Lam, Cedric, Li, Hengxu, Nguyen, Phuc-Nguyen, Lee, Raymond, McGoffin, Olivia, My, Luong, Collé, Arthur, Johnson, Jamie, Williams-King, David, Le, Linh
Format:	Preprint
Published:	2026
Subjects:	Cryptography and Security Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.11029
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918494868602880
author	Mehta, Astha Selvanayagam, Niruthiha Lam, Cedric Li, Hengxu Nguyen, Phuc-Nguyen Lee, Raymond McGoffin, Olivia My Luong Collé, Arthur Johnson, Jamie Williams-King, David Le, Linh
author_facet	Mehta, Astha Selvanayagam, Niruthiha Lam, Cedric Li, Hengxu Nguyen, Phuc-Nguyen Lee, Raymond McGoffin, Olivia My Luong Collé, Arthur Johnson, Jamie Williams-King, David Le, Linh
contents	An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_11029
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments Mehta, Astha Selvanayagam, Niruthiha Lam, Cedric Li, Hengxu Nguyen, Phuc-Nguyen Lee, Raymond McGoffin, Olivia My Luong Collé, Arthur Johnson, Jamie Williams-King, David Le, Linh Cryptography and Security Artificial Intelligence An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.
title	FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments
topic	Cryptography and Security Artificial Intelligence
url	https://arxiv.org/abs/2605.11029

Similar Items