:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Lermen, Simon, Dziemian, Mateusz, Antolín, Natalia Pérez-Campanero
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2504.07831
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
by: Lermen, Simon, et al.
Published: (2024)

Too Big to Fool: Resisting Deception in Language Models
by: Samsami, Mohammad Reza, et al.
Published: (2024)

CryptoFormalEval: Integrating LLMs and Formal Verification for Automated Cryptographic Protocol Vulnerability Detection
by: Curaba, Cristian, et al.
Published: (2024)

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering
by: Ong, Kenneth J. K., et al.
Published: (2025)

Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
by: Warczyński, Jędrzej, et al.
Published: (2025)

Language Model Re-rankers are Fooled by Lexical Similarities
by: Hagström, Lovisa, et al.
Published: (2025)

To Tell The Truth: Language of Deception and Language Models
by: Hazra, Sanchaita, et al.
Published: (2023)

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
by: Beltoft, Stine Lyngsø, et al.
Published: (2026)

LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators
by: Lango, Mateusz, et al.
Published: (2025)

PHANTOM RECALL: When Familiar Puzzles Fool Smart Models
by: Mukhopadhyay, Souradeep, et al.
Published: (2025)

LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
by: Olson, Matthew Lyle, et al.
Published: (2026)

Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
by: Pedrotti, Andrea, et al.
Published: (2025)

An Assessment of Model-On-Model Deception
by: Heitkoetter, Julius, et al.
Published: (2024)

Seamless Deception: Larger Language Models Are Better Knowledge Concealers
by: Ashok, Dhananjay, et al.
Published: (2026)

Deception Abilities Emerged in Large Language Models
by: Hagendorff, Thilo
Published: (2023)

Fooling the Textual Fooler via Randomizing Latent Representations
by: Hoang, Duy C., et al.
Published: (2023)

Automated Interpretability and Feature Discovery in Language Models with Agents
by: Marin-Llobet, Arnau, et al.
Published: (2026)

Unmasking the Shadows of AI: Investigating Deceptive Capabilities in Large Language Models
by: Guo, Linge
Published: (2024)

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning
by: Merrill, Scott, et al.
Published: (2026)

Compromising Honesty and Harmlessness in Language Models via Deception Attacks
by: Vaugrante, Laurène, et al.
Published: (2025)

Language Models Optimized to Fool Detectors Still Have a Distinct Style (And How to Change It)
by: Soto, Rafael Rivera, et al.
Published: (2025)

From Deception to Detection: The Dual Roles of Large Language Models in Fake News
by: Sallami, Dorsaf, et al.
Published: (2024)

Can AI Models be Jailbroken to Phish Elderly Victims? An End-to-End Evaluation
by: Heiding, Fred, et al.
Published: (2025)

Great Models Think Alike and this Undermines AI Oversight
by: Goel, Shashwat, et al.
Published: (2025)

Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
by: Chen, Zixin, et al.
Published: (2025)

Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models
by: Mouselinos, Spyridon, et al.
Published: (2024)

On the Importance and Evaluation of Narrativity in Natural Language AI Explanations
by: Cedro, Mateusz, et al.
Published: (2026)

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
by: Berg, Cameron, et al.
Published: (2026)

OpenDeception: Learning Deception and Trust in Human-AI Interaction via Multi-Agent Simulation
by: Wu, Yichen, et al.
Published: (2025)

Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL
by: Abdulhai, Marwa, et al.
Published: (2025)

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
by: Järviniemi, Olli, et al.
Published: (2024)

SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
by: Saiem, Bijoy Ahmed, et al.
Published: (2024)

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
by: Andriushchenko, Maksym, et al.
Published: (2024)

Towards Scalable Oversight via Partitioned Human Supervision
by: Yin, Ren, et al.
Published: (2025)

Dagger Behind Smile: Fool LLMs with a Happy Ending Story
by: Song, Xurui, et al.
Published: (2025)

Building a Precise Video Language with Human-AI Oversight
by: Lin, Zhiqiu, et al.
Published: (2026)

Preservation of Language Understanding Capabilities in Speech-aware Large Language Models
by: Kubis, Marek, et al.
Published: (2025)

The Steganographic Potentials of Language Models
by: Karpov, Artem, et al.
Published: (2025)

Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
by: Michaluk, Wojciech, et al.
Published: (2026)

NYT-Connections: A Deceptively Simple Text Classification Task that Stumps System-1 Thinkers
by: Lopez, Angel Yahir Loredo, et al.
Published: (2024)