Saved in:
| Main Authors: | Turtayev, Rustem, Fedorova, Natalia, Serikov, Oleg, Koldyba, Sergey, Avagyan, Lev, Volkov, Dmitrii |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.19738 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Hacking CTFs with Plain Agents
by: Turtayev, Rustem, et al.
Published: (2024)
by: Turtayev, Rustem, et al.
Published: (2024)
LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild
by: Reworr, et al.
Published: (2024)
by: Reworr, et al.
Published: (2024)
Evaluating AI cyber capabilities with crowdsourced elicitation
by: Petrov, Artem, et al.
Published: (2025)
by: Petrov, Artem, et al.
Published: (2025)
Hodoscope: Unsupervised Monitoring for AI Misbehaviors
by: Zhong, Ziqian, et al.
Published: (2026)
by: Zhong, Ziqian, et al.
Published: (2026)
Training Agents to Self-Report Misbehavior
by: Lee, Bruce W., et al.
Published: (2026)
by: Lee, Bruce W., et al.
Published: (2026)
Badllama 3: removing safety finetuning from Llama 3 in minutes
by: Volkov, Dmitrii
Published: (2024)
by: Volkov, Dmitrii
Published: (2024)
Wink: Recovering from Misbehaviors in Coding Agents
by: Nanda, Rahul, et al.
Published: (2026)
by: Nanda, Rahul, et al.
Published: (2026)
How to Correctly do Semantic Backpropagation on Language-based Agentic Systems
by: Wang, Wenyi, et al.
Published: (2024)
by: Wang, Wenyi, et al.
Published: (2024)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
by: Baker, Bowen, et al.
Published: (2025)
by: Baker, Bowen, et al.
Published: (2025)
Demonstrating specification gaming in reasoning models
by: Bondarenko, Alexander, et al.
Published: (2025)
by: Bondarenko, Alexander, et al.
Published: (2025)
Agent Hunt: Bounty Based Collaborative Autoformalization With LLM Agents
by: Brown, Chad E., et al.
Published: (2026)
by: Brown, Chad E., et al.
Published: (2026)
LLMScan: Causal Scan for LLM Misbehavior Detection
by: Zhang, Mengdi, et al.
Published: (2024)
by: Zhang, Mengdi, et al.
Published: (2024)
Multi-layer random features and the approximation power of neural networks
by: Takhanov, Rustem
Published: (2024)
by: Takhanov, Rustem
Published: (2024)
The Current State of AI Bias Bounties: An Overview of Existing Programmes and Research
by: Kucenko, Sergej, et al.
Published: (2025)
by: Kucenko, Sergej, et al.
Published: (2025)
BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
by: Zhang, Andy K., et al.
Published: (2025)
by: Zhang, Andy K., et al.
Published: (2025)
TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality
by: Liu, Zidong, et al.
Published: (2026)
by: Liu, Zidong, et al.
Published: (2026)
From Single Agent to Multi-Agent: Improving Traffic Signal Control
by: Tislenko, Maksim, et al.
Published: (2024)
by: Tislenko, Maksim, et al.
Published: (2024)
Convergence dynamics of Agent-to-Agent Interactions with Misaligned objectives
by: Cosentino, Romain, et al.
Published: (2025)
by: Cosentino, Romain, et al.
Published: (2025)
Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
by: Min, Nay Myat, et al.
Published: (2026)
by: Min, Nay Myat, et al.
Published: (2026)
A Novel Labeled Human Voice Signal Dataset for Misbehavior Detection
by: Raza, Ali, et al.
Published: (2024)
by: Raza, Ali, et al.
Published: (2024)
Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding
by: Takhanov, Rustem, et al.
Published: (2026)
by: Takhanov, Rustem, et al.
Published: (2026)
Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning
by: Dvirniak, Artem, et al.
Published: (2026)
by: Dvirniak, Artem, et al.
Published: (2026)
AttentionGuard: Transformer-based Misbehavior Detection for Secure Vehicular Platoons
by: Li, Hexu, et al.
Published: (2025)
by: Li, Hexu, et al.
Published: (2025)
The Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process
by: Carichon, Florian, et al.
Published: (2025)
by: Carichon, Florian, et al.
Published: (2025)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents
by: Fang, Haishuo, et al.
Published: (2024)
by: Fang, Haishuo, et al.
Published: (2024)
Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking
by: Costabile, Luigia, et al.
Published: (2025)
by: Costabile, Luigia, et al.
Published: (2025)
Attention in Motion: Secure Platooning via Transformer-based Misbehavior Detection
by: Kalogiannis, Konstantinos, et al.
Published: (2025)
by: Kalogiannis, Konstantinos, et al.
Published: (2025)
Emergent Misalignment is Easy, Narrow Misalignment is Hard
by: Soligo, Anna, et al.
Published: (2026)
by: Soligo, Anna, et al.
Published: (2026)
Semantic Laundering in AI Agent Architectures: Why Tool Boundaries Do Not Confer Epistemic Warrant
by: Romanchuk, Oleg, et al.
Published: (2026)
by: Romanchuk, Oleg, et al.
Published: (2026)
Probabilistic Verification of Voice Anti-Spoofing Models
by: Kushnir, Evgeny, et al.
Published: (2026)
by: Kushnir, Evgeny, et al.
Published: (2026)
Experimental Narratives: A Comparison of Human Crowdsourced Storytelling and AI Storytelling
by: Begus, Nina
Published: (2023)
by: Begus, Nina
Published: (2023)
LLMs Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
by: Hu, Xuhao, et al.
Published: (2025)
by: Hu, Xuhao, et al.
Published: (2025)
Human Attribution of Causality to AI Across Agency, Misuse, and Misalignment
by: Carro, Maria Victoria, et al.
Published: (2026)
by: Carro, Maria Victoria, et al.
Published: (2026)
Overtrained, Not Misaligned
by: Schreiber, Joel, et al.
Published: (2026)
by: Schreiber, Joel, et al.
Published: (2026)
Human-in-the-Loop and AI: Crowdsourcing Metadata Vocabulary for Materials Science
by: Greenberg, Jane, et al.
Published: (2025)
by: Greenberg, Jane, et al.
Published: (2025)
ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use
by: Tien, Jeremy, et al.
Published: (2026)
by: Tien, Jeremy, et al.
Published: (2026)
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
by: Hägele, Alexander, et al.
Published: (2026)
by: Hägele, Alexander, et al.
Published: (2026)
Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems
by: Weckbecker, Moritz, et al.
Published: (2026)
by: Weckbecker, Moritz, et al.
Published: (2026)
Password-Activated Shutdown Protocols for Misaligned Frontier Agents
by: Williams, Kai, et al.
Published: (2025)
by: Williams, Kai, et al.
Published: (2025)
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents
by: Naik, Akshat, et al.
Published: (2025)
by: Naik, Akshat, et al.
Published: (2025)
Similar Items
-
Hacking CTFs with Plain Agents
by: Turtayev, Rustem, et al.
Published: (2024) -
LLM Agent Honeypot: Monitoring AI Hacking Agents in the Wild
by: Reworr, et al.
Published: (2024) -
Evaluating AI cyber capabilities with crowdsourced elicitation
by: Petrov, Artem, et al.
Published: (2025) -
Hodoscope: Unsupervised Monitoring for AI Misbehaviors
by: Zhong, Ziqian, et al.
Published: (2026) -
Training Agents to Self-Report Misbehavior
by: Lee, Bruce W., et al.
Published: (2026)