Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wachowiak, Lennart, Blain, Scott D., Williams-King, David, Marro, Samuele
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence Computers and Society Human-Computer Interaction Multiagent Systems I.2.0; I.2.7; J.4; H.5; K.4.2
Online Access:	https://arxiv.org/abs/2605.08321
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915995473412096
author	Wachowiak, Lennart Blain, Scott D. Williams-King, David Marro, Samuele
author_facet	Wachowiak, Lennart Blain, Scott D. Williams-King, David Marro, Samuele
contents	LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs achieve their hidden goals in 34.7% of cases, which warden models reduce to 12.3%. Notably, even warden models substantially weaker than the adversary they oversee provide meaningful protection, suggesting a path for scalable oversight of more capable models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_08321
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight Wachowiak, Lennart Blain, Scott D. Williams-King, David Marro, Samuele Machine Learning Artificial Intelligence Computers and Society Human-Computer Interaction Multiagent Systems I.2.0; I.2.7; J.4; H.5; K.4.2 LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs achieve their hidden goals in 34.7% of cases, which warden models reduce to 12.3%. Notably, even warden models substantially weaker than the adversary they oversee provide meaningful protection, suggesting a path for scalable oversight of more capable models.
title	LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight
topic	Machine Learning Artificial Intelligence Computers and Society Human-Computer Interaction Multiagent Systems I.2.0; I.2.7; J.4; H.5; K.4.2
url	https://arxiv.org/abs/2605.08321

Similar Items