Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shen, Jocelyn, Luvsanchultem, Amina, Kim, Jessica, Smith, Kynnedy, Danry, Valdemar, Rogers, Kantwon, Park, Hae Won, Sap, Maarten, Breazeal, Cynthia
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Computation and Language
Online-Zugang:	https://arxiv.org/abs/2603.20907
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866911547891122176
author	Shen, Jocelyn Luvsanchultem, Amina Kim, Jessica Smith, Kynnedy Danry, Valdemar Rogers, Kantwon Park, Hae Won Sap, Maarten Breazeal, Cynthia
author_facet	Shen, Jocelyn Luvsanchultem, Amina Kim, Jessica Smith, Kynnedy Danry, Valdemar Rogers, Kantwon Park, Hae Won Sap, Maarten Breazeal, Cynthia
contents	As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to subtle steering toward hidden incentives misaligned with their own interests. While existing NLP research has benchmarked manipulation detection, these efforts often rely on simulated debates and remain fundamentally decoupled from actual human belief shifts in real-world scenarios. We introduce PUPPET, a theoretical taxonomy and resource that bridges this gap by focusing on the moral direction of hidden incentives in everyday, advice-giving contexts. We provide an evaluation dataset of N=1,035 human-LLM interactions, where we measure users' belief shifts. Our analysis reveals a critical disconnect in current safety paradigms: while models can be trained to detect manipulative strategies, they do not correlate with the magnitude of resulting belief change. As such, we define the task of belief shift prediction and show that while state-of-the-art LLMs achieve moderate correlation (r=0.3-0.5), they systematically underestimate the intensity of human belief susceptibility. This work establishes a theoretically grounded and behaviorally validated foundation for AI social safety efforts by studying incentive-driven manipulation in LLMs during everyday, practical user queries.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_20907
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	The Hidden Puppet Master: Predicting Human Belief Change in Manipulative LLM Dialogues Shen, Jocelyn Luvsanchultem, Amina Kim, Jessica Smith, Kynnedy Danry, Valdemar Rogers, Kantwon Park, Hae Won Sap, Maarten Breazeal, Cynthia Computation and Language As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to subtle steering toward hidden incentives misaligned with their own interests. While existing NLP research has benchmarked manipulation detection, these efforts often rely on simulated debates and remain fundamentally decoupled from actual human belief shifts in real-world scenarios. We introduce PUPPET, a theoretical taxonomy and resource that bridges this gap by focusing on the moral direction of hidden incentives in everyday, advice-giving contexts. We provide an evaluation dataset of N=1,035 human-LLM interactions, where we measure users' belief shifts. Our analysis reveals a critical disconnect in current safety paradigms: while models can be trained to detect manipulative strategies, they do not correlate with the magnitude of resulting belief change. As such, we define the task of belief shift prediction and show that while state-of-the-art LLMs achieve moderate correlation (r=0.3-0.5), they systematically underestimate the intensity of human belief susceptibility. This work establishes a theoretically grounded and behaviorally validated foundation for AI social safety efforts by studying incentive-driven manipulation in LLMs during everyday, practical user queries.
title	The Hidden Puppet Master: Predicting Human Belief Change in Manipulative LLM Dialogues
topic	Computation and Language
url	https://arxiv.org/abs/2603.20907

Ähnliche Einträge