Saved in:
| Main Authors: | Sandoval, Aaron, Rushing, Cody |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.02157 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Factor(U,T): Controlling Untrusted AI by Monitoring their Plans
by: Lip, Edward Lue Chee, et al.
Published: (2025)
by: Lip, Edward Lue Chee, et al.
Published: (2025)
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
by: Yan, Lecheng, et al.
Published: (2026)
by: Yan, Lecheng, et al.
Published: (2026)
Basic Legibility Protocols Improve Trusted Monitoring
by: Sreevatsa, Ashwin, et al.
Published: (2026)
by: Sreevatsa, Ashwin, et al.
Published: (2026)
Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems
by: Yarmohammadtoosky, Sahar, et al.
Published: (2025)
by: Yarmohammadtoosky, Sahar, et al.
Published: (2025)
BashArena: A Control Setting for Highly Privileged AI Agents
by: Kaufman, Adam, et al.
Published: (2025)
by: Kaufman, Adam, et al.
Published: (2025)
Subversion via Focal Points: Investigating Collusion in LLM Monitoring
by: Järviniemi, Olli
Published: (2025)
by: Järviniemi, Olli
Published: (2025)
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
by: Chrabąszcz, Maciej, et al.
Published: (2026)
by: Chrabąszcz, Maciej, et al.
Published: (2026)
ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs
by: Zhao, Gejian, et al.
Published: (2025)
by: Zhao, Gejian, et al.
Published: (2025)
Towards Proactive Defense Against Cyber Cognitive Attacks
by: Rushing, Bonnie, et al.
Published: (2025)
by: Rushing, Bonnie, et al.
Published: (2025)
Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents
by: Liang, Zhibo, et al.
Published: (2025)
by: Liang, Zhibo, et al.
Published: (2025)
Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models
by: Ruzzetti, Elena Sofia, et al.
Published: (2025)
by: Ruzzetti, Elena Sofia, et al.
Published: (2025)
MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
by: Zhou, Zhenhong, et al.
Published: (2026)
by: Zhou, Zhenhong, et al.
Published: (2026)
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
by: Jin, Xisen, et al.
Published: (2026)
by: Jin, Xisen, et al.
Published: (2026)
Triad: Trusted Timestamps in Untrusted Environments
by: Fernandez, Gabriel P., et al.
Published: (2023)
by: Fernandez, Gabriel P., et al.
Published: (2023)
ExtremeAIGC: Benchmarking LMM Vulnerability to AI-Generated Extremist Content
by: Chandna, Bhavik, et al.
Published: (2025)
by: Chandna, Bhavik, et al.
Published: (2025)
Enforcing Attestable Workflows across Untrusted Networks
by: Dang, Hung, et al.
Published: (2026)
by: Dang, Hung, et al.
Published: (2026)
GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors
by: Meng, Wenlong, et al.
Published: (2025)
by: Meng, Wenlong, et al.
Published: (2025)
Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare
by: Zhang, Hang, et al.
Published: (2025)
by: Zhang, Hang, et al.
Published: (2025)
Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy
by: Fu, Yu, et al.
Published: (2023)
by: Fu, Yu, et al.
Published: (2023)
Towards Understanding the Cognitive Habits of Large Reasoning Models
by: Dong, Jianshuo, et al.
Published: (2025)
by: Dong, Jianshuo, et al.
Published: (2025)
"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers
by: Zhou, Qin, et al.
Published: (2025)
by: Zhou, Qin, et al.
Published: (2025)
Private Aggregate Queries to Untrusted Databases
by: Hafiz, Syed Mahbub, et al.
Published: (2024)
by: Hafiz, Syed Mahbub, et al.
Published: (2024)
LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries
by: Wu, Zekun, et al.
Published: (2025)
by: Wu, Zekun, et al.
Published: (2025)
AI Agents May Always Fall for Prompt Injections
by: Abdelnabi, Sahar, et al.
Published: (2026)
by: Abdelnabi, Sahar, et al.
Published: (2026)
Institutional Platform for Secure Self-Service Large Language Model Exploration
by: Bumgardner, V. K. Cody, et al.
Published: (2024)
by: Bumgardner, V. K. Cody, et al.
Published: (2024)
SMTFL: Secure Model Training to Untrusted Participants in Federated Learning
by: Zhao, Zhihui, et al.
Published: (2025)
by: Zhao, Zhihui, et al.
Published: (2025)
Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media
by: Sun, Zhen, et al.
Published: (2024)
by: Sun, Zhen, et al.
Published: (2024)
Cabin: Confining Untrusted Programs within Confidential VMs
by: Mei, Benshan, et al.
Published: (2024)
by: Mei, Benshan, et al.
Published: (2024)
Covert Communication for Untrusted UAV-Assisted Wireless Systems
by: Gao, Chan, et al.
Published: (2024)
by: Gao, Chan, et al.
Published: (2024)
Towards Trustworthy Federated Learning with Untrusted Participants
by: Allouah, Youssef, et al.
Published: (2025)
by: Allouah, Youssef, et al.
Published: (2025)
In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b
by: Durner, Nils
Published: (2025)
by: Durner, Nils
Published: (2025)
AVISE: Framework for Evaluating the Security of AI Systems
by: Lempinen, Mikko, et al.
Published: (2026)
by: Lempinen, Mikko, et al.
Published: (2026)
LATTICE: Evaluating Decision Support Utility of Crypto Agents
by: Chan, Aaron, et al.
Published: (2026)
by: Chan, Aaron, et al.
Published: (2026)
RedacBench: Can AI Erase Your Secrets?
by: Jeon, Hyunjun, et al.
Published: (2026)
by: Jeon, Hyunjun, et al.
Published: (2026)
Analysis and prevention of AI-based phishing email attacks
by: Eze, Chibuike Samuel, et al.
Published: (2024)
by: Eze, Chibuike Samuel, et al.
Published: (2024)
Leveraging ASIC AI Chips for Homomorphic Encryption
by: Tong, Jianming, et al.
Published: (2025)
by: Tong, Jianming, et al.
Published: (2025)
Enabling Low-Cost Secure Computing on Untrusted In-Memory Architectures
by: Ghinani, Sahar Ghoflsaz, et al.
Published: (2025)
by: Ghinani, Sahar Ghoflsaz, et al.
Published: (2025)
Pirates: Anonymous Group Calls Over Fully Untrusted Infrastructure
by: Coijanovic, Christoph, et al.
Published: (2024)
by: Coijanovic, Christoph, et al.
Published: (2024)
VelLMes: A high-interaction AI-based deception framework
by: Sladić, Muris, et al.
Published: (2025)
by: Sladić, Muris, et al.
Published: (2025)
Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations
by: Teja, Lekkala Sai, et al.
Published: (2025)
by: Teja, Lekkala Sai, et al.
Published: (2025)
Similar Items
-
Factor(U,T): Controlling Untrusted AI by Monitoring their Plans
by: Lip, Edward Lue Chee, et al.
Published: (2025) -
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
by: Yan, Lecheng, et al.
Published: (2026) -
Basic Legibility Protocols Improve Trusted Monitoring
by: Sreevatsa, Ashwin, et al.
Published: (2026) -
Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems
by: Yarmohammadtoosky, Sahar, et al.
Published: (2025) -
BashArena: A Control Setting for Highly Privileged AI Agents
by: Kaufman, Adam, et al.
Published: (2025)