Saved in:
| Main Authors: | Dang, Jacob, Xie, Brian Y., Younis, Omar G. |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.15559 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Subliminal Learning Is Steering Vector Distillation
by: Blank, Camila, et al.
Published: (2026)
by: Blank, Camila, et al.
Published: (2026)
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
by: Schrodi, Simon, et al.
Published: (2025)
by: Schrodi, Simon, et al.
Published: (2025)
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
by: Askin, Baris, et al.
Published: (2026)
by: Askin, Baris, et al.
Published: (2026)
ClawSafety: "Safe" LLMs, Unsafe Agents
by: Wei, Bowen, et al.
Published: (2026)
by: Wei, Bowen, et al.
Published: (2026)
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
by: Dang, Hung
Published: (2026)
by: Dang, Hung
Published: (2026)
Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems
by: Weckbecker, Moritz, et al.
Published: (2026)
by: Weckbecker, Moritz, et al.
Published: (2026)
Subliminal Learning is a LoRA Artifact
by: Nief, Todd, et al.
Published: (2026)
by: Nief, Todd, et al.
Published: (2026)
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
by: Cloud, Alex, et al.
Published: (2025)
by: Cloud, Alex, et al.
Published: (2025)
History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
by: Salgado, Alberto G. Rodríguez
Published: (2026)
by: Salgado, Alberto G. Rodríguez
Published: (2026)
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
by: Jones, Jaylen, et al.
Published: (2026)
by: Jones, Jaylen, et al.
Published: (2026)
Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment
by: Kitkana, Chayanon, et al.
Published: (2026)
by: Kitkana, Chayanon, et al.
Published: (2026)
Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search
by: Neary, Cyrus, et al.
Published: (2025)
by: Neary, Cyrus, et al.
Published: (2025)
Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification
by: Wilson, Sarah, et al.
Published: (2026)
by: Wilson, Sarah, et al.
Published: (2026)
Subliminal Effects in Your Data: A General Mechanism via Log-Linearity
by: Aden-Ali, Ishaq, et al.
Published: (2026)
by: Aden-Ali, Ishaq, et al.
Published: (2026)
Behavioral Transfer in AI Agents: Evidence and Privacy Implications
by: Luo, Shilei, et al.
Published: (2026)
by: Luo, Shilei, et al.
Published: (2026)
Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses
by: Glukhov, David, et al.
Published: (2024)
by: Glukhov, David, et al.
Published: (2024)
Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment
by: Sander, Jacob, et al.
Published: (2026)
by: Sander, Jacob, et al.
Published: (2026)
Learning Through Noise: Why Subliminal Learning Works and When It Fails
by: Brockers, Vincent C., et al.
Published: (2026)
by: Brockers, Vincent C., et al.
Published: (2026)
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
by: Ni, Jingwei, et al.
Published: (2026)
by: Ni, Jingwei, et al.
Published: (2026)
ELISA: An Interpretable Hybrid Generative AI Agent for Expression-Grounded Discovery in Single-Cell Genomics
by: Coser, Omar
Published: (2026)
by: Coser, Omar
Published: (2026)
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
by: Li, Yubo, et al.
Published: (2026)
by: Li, Yubo, et al.
Published: (2026)
Unsafe LLM-Based Search: Quantitative Analysis and Mitigation of Safety Risks in AI Web Search
by: Luo, Zeren, et al.
Published: (2025)
by: Luo, Zeren, et al.
Published: (2025)
Promoting Online Safety by Simulating Unsafe Conversations with LLMs
by: Hoffman, Owen, et al.
Published: (2025)
by: Hoffman, Owen, et al.
Published: (2025)
AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes
by: Qiu, Jiahao, et al.
Published: (2025)
by: Qiu, Jiahao, et al.
Published: (2025)
Offline Behavior Distillation
by: Lei, Shiye, et al.
Published: (2024)
by: Lei, Shiye, et al.
Published: (2024)
Safe-Support Q-Learning: Learning without Unsafe Exploration
by: Lim, Yeeun, et al.
Published: (2026)
by: Lim, Yeeun, et al.
Published: (2026)
Bidirectional Distillation: A Mixed-Play Framework for Multi-Agent Generalizable Behaviors
by: Feng, Lang, et al.
Published: (2025)
by: Feng, Lang, et al.
Published: (2025)
Transferable XAI: Relating Understanding Across Domains with Explanation Transfer
by: Wang, Fei, et al.
Published: (2026)
by: Wang, Fei, et al.
Published: (2026)
Data-Free Generative Replay for Class-Incremental Learning on Imbalanced Data
by: Younis, Sohaib, et al.
Published: (2024)
by: Younis, Sohaib, et al.
Published: (2024)
Open, Reliable, and Collective: A Community-Driven Framework for Tool-Using AI Agents
by: Dang, Hy, et al.
Published: (2026)
by: Dang, Hy, et al.
Published: (2026)
The Agent Behavior: Model, Governance and Challenges in the AI Digital Age
by: Zhang, Qiang, et al.
Published: (2025)
by: Zhang, Qiang, et al.
Published: (2025)
Towards Understanding Unsafe Video Generation
by: Pang, Yan, et al.
Published: (2024)
by: Pang, Yan, et al.
Published: (2024)
Forage V2: Knowledge Evolution and Transfer in Autonomous Agent Organizations
by: Xie, Huaqing
Published: (2026)
by: Xie, Huaqing
Published: (2026)
AIBuildAI: An AI Agent for Automatically Building AI Models
by: Zhang, Ruiyi, et al.
Published: (2026)
by: Zhang, Ruiyi, et al.
Published: (2026)
Differentiable and Stable Long-Range Tracking of Multiple Posterior Modes
by: Younis, Ali, et al.
Published: (2024)
by: Younis, Ali, et al.
Published: (2024)
Predicting AI Agent Behavior through Approximation of the Perron-Frobenius Operator
by: Zhang, Shiqi, et al.
Published: (2024)
by: Zhang, Shiqi, et al.
Published: (2024)
Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models
by: Cai, Wei, et al.
Published: (2025)
by: Cai, Wei, et al.
Published: (2025)
Safe Vision-Language Models via Unsafe Weights Manipulation
by: D'Incà, Moreno, et al.
Published: (2025)
by: D'Incà, Moreno, et al.
Published: (2025)
AI-Powered Database Management: Predictive Analytics for Performance Tuning
by: Shahwan, Younis Ali, et al.
Published: (2025)
by: Shahwan, Younis Ali, et al.
Published: (2025)
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
by: Zhang, Dongcheng, et al.
Published: (2026)
by: Zhang, Dongcheng, et al.
Published: (2026)
Similar Items
-
Subliminal Learning Is Steering Vector Distillation
by: Blank, Camila, et al.
Published: (2026) -
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
by: Schrodi, Simon, et al.
Published: (2025) -
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
by: Askin, Baris, et al.
Published: (2026) -
ClawSafety: "Safe" LLMs, Unsafe Agents
by: Wei, Bowen, et al.
Published: (2026) -
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
by: Dang, Hung
Published: (2026)