Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Shen, Jocelyn, Luvsanchultem, Amina, Kim, Jessica, Smith, Kynnedy, Danry, Valdemar, Rogers, Kantwon, Park, Hae Won, Sap, Maarten, Breazeal, Cynthia
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2603.20907
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866911547891122176
author Shen, Jocelyn
Luvsanchultem, Amina
Kim, Jessica
Smith, Kynnedy
Danry, Valdemar
Rogers, Kantwon
Park, Hae Won
Sap, Maarten
Breazeal, Cynthia
author_facet Shen, Jocelyn
Luvsanchultem, Amina
Kim, Jessica
Smith, Kynnedy
Danry, Valdemar
Rogers, Kantwon
Park, Hae Won
Sap, Maarten
Breazeal, Cynthia
contents As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to subtle steering toward hidden incentives misaligned with their own interests. While existing NLP research has benchmarked manipulation detection, these efforts often rely on simulated debates and remain fundamentally decoupled from actual human belief shifts in real-world scenarios. We introduce PUPPET, a theoretical taxonomy and resource that bridges this gap by focusing on the moral direction of hidden incentives in everyday, advice-giving contexts. We provide an evaluation dataset of N=1,035 human-LLM interactions, where we measure users' belief shifts. Our analysis reveals a critical disconnect in current safety paradigms: while models can be trained to detect manipulative strategies, they do not correlate with the magnitude of resulting belief change. As such, we define the task of belief shift prediction and show that while state-of-the-art LLMs achieve moderate correlation (r=0.3-0.5), they systematically underestimate the intensity of human belief susceptibility. This work establishes a theoretically grounded and behaviorally validated foundation for AI social safety efforts by studying incentive-driven manipulation in LLMs during everyday, practical user queries.
format Preprint
id arxiv_https___arxiv_org_abs_2603_20907
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle The Hidden Puppet Master: Predicting Human Belief Change in Manipulative LLM Dialogues
Shen, Jocelyn
Luvsanchultem, Amina
Kim, Jessica
Smith, Kynnedy
Danry, Valdemar
Rogers, Kantwon
Park, Hae Won
Sap, Maarten
Breazeal, Cynthia
Computation and Language
As users increasingly turn to LLMs for practical and personal advice, they become vulnerable to subtle steering toward hidden incentives misaligned with their own interests. While existing NLP research has benchmarked manipulation detection, these efforts often rely on simulated debates and remain fundamentally decoupled from actual human belief shifts in real-world scenarios. We introduce PUPPET, a theoretical taxonomy and resource that bridges this gap by focusing on the moral direction of hidden incentives in everyday, advice-giving contexts. We provide an evaluation dataset of N=1,035 human-LLM interactions, where we measure users' belief shifts. Our analysis reveals a critical disconnect in current safety paradigms: while models can be trained to detect manipulative strategies, they do not correlate with the magnitude of resulting belief change. As such, we define the task of belief shift prediction and show that while state-of-the-art LLMs achieve moderate correlation (r=0.3-0.5), they systematically underestimate the intensity of human belief susceptibility. This work establishes a theoretically grounded and behaviorally validated foundation for AI social safety efforts by studying incentive-driven manipulation in LLMs during everyday, practical user queries.
title The Hidden Puppet Master: Predicting Human Belief Change in Manipulative LLM Dialogues
topic Computation and Language
url https://arxiv.org/abs/2603.20907