Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Talokar, Nivya, Tarun, Ayush K, Mandal, Murari, Andriushchenko, Maksym, Bosselut, Antoine
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2602.16346
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911690051813376
author	Talokar, Nivya Tarun, Ayush K Mandal, Murari Andriushchenko, Maksym Bosselut, Antoine
author_facet	Talokar, Nivya Tarun, Ayush K Mandal, Murari Andriushchenko, Maksym Bosselut, Antoine
contents	LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_16346
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents Talokar, Nivya Tarun, Ayush K Mandal, Murari Andriushchenko, Maksym Bosselut, Antoine Computation and Language Machine Learning LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.
title	Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
topic	Computation and Language Machine Learning
url	https://arxiv.org/abs/2602.16346

Similar Items