Saved in:
| Main Authors: | Lermen, Simon, Dziemian, Mateusz, Antolín, Natalia Pérez-Campanero |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.07831 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
by: Lermen, Simon, et al.
Published: (2024)
by: Lermen, Simon, et al.
Published: (2024)
Too Big to Fool: Resisting Deception in Language Models
by: Samsami, Mohammad Reza, et al.
Published: (2024)
by: Samsami, Mohammad Reza, et al.
Published: (2024)
CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection
by: Curaba, Cristian, et al.
Published: (2024)
by: Curaba, Cristian, et al.
Published: (2024)
Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering
by: Ong, Kenneth J. K., et al.
Published: (2025)
by: Ong, Kenneth J. K., et al.
Published: (2025)
Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
by: Warczyński, Jędrzej, et al.
Published: (2025)
by: Warczyński, Jędrzej, et al.
Published: (2025)
Language Model Re-rankers are Fooled by Lexical Similarities
by: Hagström, Lovisa, et al.
Published: (2025)
by: Hagström, Lovisa, et al.
Published: (2025)
To Tell The Truth: Language of Deception and Language Models
by: Hazra, Sanchaita, et al.
Published: (2023)
by: Hazra, Sanchaita, et al.
Published: (2023)
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
by: Beltoft, Stine Lyngsø, et al.
Published: (2026)
by: Beltoft, Stine Lyngsø, et al.
Published: (2026)
LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators
by: Lango, Mateusz, et al.
Published: (2025)
by: Lango, Mateusz, et al.
Published: (2025)
PHANTOM RECALL: When Familiar Puzzles Fool Smart Models
by: Mukhopadhyay, Souradeep, et al.
Published: (2025)
by: Mukhopadhyay, Souradeep, et al.
Published: (2025)
LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
by: Olson, Matthew Lyle, et al.
Published: (2026)
by: Olson, Matthew Lyle, et al.
Published: (2026)
Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
by: Pedrotti, Andrea, et al.
Published: (2025)
by: Pedrotti, Andrea, et al.
Published: (2025)
An Assessment of Model-On-Model Deception
by: Heitkoetter, Julius, et al.
Published: (2024)
by: Heitkoetter, Julius, et al.
Published: (2024)
Seamless Deception: Larger Language Models Are Better Knowledge Concealers
by: Ashok, Dhananjay, et al.
Published: (2026)
by: Ashok, Dhananjay, et al.
Published: (2026)
Deception Abilities Emerged in Large Language Models
by: Hagendorff, Thilo
Published: (2023)
by: Hagendorff, Thilo
Published: (2023)
Fooling the Textual Fooler via Randomizing Latent Representations
by: Hoang, Duy C., et al.
Published: (2023)
by: Hoang, Duy C., et al.
Published: (2023)
Automated Interpretability and Feature Discovery in Language Models with Agents
by: Marin-Llobet, Arnau, et al.
Published: (2026)
by: Marin-Llobet, Arnau, et al.
Published: (2026)
Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models
by: Guo, Linge
Published: (2024)
by: Guo, Linge
Published: (2024)
The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning
by: Merrill, Scott, et al.
Published: (2026)
by: Merrill, Scott, et al.
Published: (2026)
Compromising Honesty and Harmlessness in Language Models via Deception Attacks
by: Vaugrante, Laurène, et al.
Published: (2025)
by: Vaugrante, Laurène, et al.
Published: (2025)
Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)
by: Soto, Rafael Rivera, et al.
Published: (2025)
by: Soto, Rafael Rivera, et al.
Published: (2025)
From Deception to Detection: The Dual Roles of Large Language Models in Fake News
by: Sallami, Dorsaf, et al.
Published: (2024)
by: Sallami, Dorsaf, et al.
Published: (2024)
Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation
by: Heiding, Fred, et al.
Published: (2025)
by: Heiding, Fred, et al.
Published: (2025)
Great Models Think Alike and this Undermines AI Oversight
by: Goel, Shashwat, et al.
Published: (2025)
by: Goel, Shashwat, et al.
Published: (2025)
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
by: Chen, Zixin, et al.
Published: (2025)
by: Chen, Zixin, et al.
Published: (2025)
Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models
by: Mouselinos, Spyridon, et al.
Published: (2024)
by: Mouselinos, Spyridon, et al.
Published: (2024)
On the Importance and Evaluation of Narrativity in Natural Language AI Explanations
by: Cedro, Mateusz, et al.
Published: (2026)
by: Cedro, Mateusz, et al.
Published: (2026)
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
by: Berg, Cameron, et al.
Published: (2026)
by: Berg, Cameron, et al.
Published: (2026)
OpenDeception: Learning Deception and Trust in Human-AI Interaction via Multi-Agent Simulation
by: Wu, Yichen, et al.
Published: (2025)
by: Wu, Yichen, et al.
Published: (2025)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL
by: Abdulhai, Marwa, et al.
Published: (2025)
by: Abdulhai, Marwa, et al.
Published: (2025)
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
by: Järviniemi, Olli, et al.
Published: (2024)
by: Järviniemi, Olli, et al.
Published: (2024)
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
by: Andriushchenko, Maksym, et al.
Published: (2024)
by: Andriushchenko, Maksym, et al.
Published: (2024)
Towards Scalable Oversight via Partitioned Human Supervision
by: Yin, Ren, et al.
Published: (2025)
by: Yin, Ren, et al.
Published: (2025)
Dagger Behind Smile: Fool LLMs with a Happy Ending Story
by: Song, Xurui, et al.
Published: (2025)
by: Song, Xurui, et al.
Published: (2025)
Building a Precise Video Language with Human-AI Oversight
by: Lin, Zhiqiu, et al.
Published: (2026)
by: Lin, Zhiqiu, et al.
Published: (2026)
Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
by: Kubis, Marek, et al.
Published: (2025)
by: Kubis, Marek, et al.
Published: (2025)
The Steganographic Potentials of Language Models
by: Karpov, Artem, et al.
Published: (2025)
by: Karpov, Artem, et al.
Published: (2025)
Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
by: Michaluk, Wojciech, et al.
Published: (2026)
by: Michaluk, Wojciech, et al.
Published: (2026)
NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers
by: Lopez, Angel Yahir Loredo, et al.
Published: (2024)
by: Lopez, Angel Yahir Loredo, et al.
Published: (2024)
Similar Items
-
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
by: Lermen, Simon, et al.
Published: (2024) -
Too Big to Fool: Resisting Deception in Language Models
by: Samsami, Mohammad Reza, et al.
Published: (2024) -
CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection
by: Curaba, Cristian, et al.
Published: (2024) -
Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering
by: Ong, Kenneth J. K., et al.
Published: (2025) -
Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
by: Warczyński, Jędrzej, et al.
Published: (2025)