:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Turtayev, Rustem, Fedorova, Natalia, Serikov, Oleg, Koldyba, Sergey, Avagyan, Lev, Volkov, Dmitrii
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.19738
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Hacking CTFs with Plain Agents
by: Turtayev, Rustem, et al.
Published: (2024)

LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild
by: Reworr, et al.
Published: (2024)

Evaluating AI cyber capabilities with crowdsourced elicitation
by: Petrov, Artem, et al.
Published: (2025)

Hodoscope: Unsupervised Monitoring for AI Misbehaviors
by: Zhong, Ziqian, et al.
Published: (2026)

Training Agents to Self-Report Misbehavior
by: Lee, Bruce W., et al.
Published: (2026)

Badllama 3: removing safety finetuning from Llama 3 in minutes
by: Volkov, Dmitrii
Published: (2024)

Wink: Recovering from Misbehaviors in Coding Agents
by: Nanda, Rahul, et al.
Published: (2026)

How to Correctly do Semantic Backpropagation on Language-based Agentic Systems
by: Wang, Wenyi, et al.
Published: (2024)

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
by: Baker, Bowen, et al.
Published: (2025)

Demonstrating specification gaming in reasoning models
by: Bondarenko, Alexander, et al.
Published: (2025)

Agent Hunt: Bounty Based Collaborative Autoformalization With LLM Agents
by: Brown, Chad E., et al.
Published: (2026)

LLMScan: Causal Scan for LLM Misbehavior Detection
by: Zhang, Mengdi, et al.
Published: (2024)

Multi-layer random features and the approximation power of neural networks
by: Takhanov, Rustem
Published: (2024)

The Current State of AI Bias Bounties: An Overview of Existing Programmes and Research
by: Kucenko, Sergej, et al.
Published: (2025)

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
by: Zhang, Andy K., et al.
Published: (2025)

TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality
by: Liu, Zidong, et al.
Published: (2026)

From Single Agent to Multi-Agent: Improving Traffic Signal Control
by: Tislenko, Maksim, et al.
Published: (2024)

Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives
by: Cosentino, Romain, et al.
Published: (2025)

Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
by: Min, Nay Myat, et al.
Published: (2026)

A Novel Labeled Human Voice Signal Dataset for Misbehavior Detection
by: Raza, Ali, et al.
Published: (2024)

Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding
by: Takhanov, Rustem, et al.
Published: (2026)

Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning
by: Dvirniak, Artem, et al.
Published: (2026)

AttentionGuard: Transformer-based Misbehavior Detection for Secure Vehicular Platoons
by: Li, Hexu, et al.
Published: (2025)

The Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process
by: Carichon, Florian, et al.
Published: (2025)

Preemptive Detection and Correction of Misaligned Actions in LLM Agents
by: Fang, Haishuo, et al.
Published: (2024)

Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking
by: Costabile, Luigia, et al.
Published: (2025)

Attention in Motion: Secure Platooning via Transformer-based Misbehavior Detection
by: Kalogiannis, Konstantinos, et al.
Published: (2025)

Emergent Misalignment is Easy, Narrow Misalignment is Hard
by: Soligo, Anna, et al.
Published: (2026)

Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant
by: Romanchuk, Oleg, et al.
Published: (2026)

Probabilistic Verification of Voice Anti-Spoofing Models
by: Kushnir, Evgeny, et al.
Published: (2026)

Experimental Narratives: A Comparison of Human Crowdsourced Storytelling and AI Storytelling
by: Begus, Nina
Published: (2023)

LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
by: Hu, Xuhao, et al.
Published: (2025)

Human Attribution of Causality to AI Across Agency, Misuse, and Misalignment
by: Carro, Maria Victoria, et al.
Published: (2026)

Overtrained, Not Misaligned
by: Schreiber, Joel, et al.
Published: (2026)

Human-in-the-Loop and AI: Crowdsourcing Metadata Vocabulary for Materials Science
by: Greenberg, Jane, et al.
Published: (2025)

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use
by: Tien, Jeremy, et al.
Published: (2026)

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
by: Hägele, Alexander, et al.
Published: (2026)

Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems
by: Weckbecker, Moritz, et al.
Published: (2026)

Password-Activated Shutdown Protocols for Misaligned Frontier Agents
by: Williams, Kai, et al.
Published: (2025)

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
by: Naik, Akshat, et al.
Published: (2025)