Saved in:
| Main Authors: | Wen, Jiaxin, Hebbar, Vivek, Larson, Caleb, Bhatt, Aryan, Radhakrishnan, Ansh, Sharma, Mrinank, Sleight, Henry, Feng, Shi, He, He, Perez, Ethan, Shlegeris, Buck, Khan, Akbir |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.17693 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases
by: Gan, Eric, et al.
Published: (2026)
by: Gan, Eric, et al.
Published: (2026)
Evaluating Control Protocols for Untrusted AI Agents
by: Kutasov, Jon, et al.
Published: (2025)
by: Kutasov, Jon, et al.
Published: (2025)
Ctrl-Z: Controlling AI Agents via Resampling
by: Bhatt, Aryan, et al.
Published: (2025)
by: Bhatt, Aryan, et al.
Published: (2025)
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
by: Peng, Alwin, et al.
Published: (2024)
by: Peng, Alwin, et al.
Published: (2024)
Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
by: Youstra, Jack, et al.
Published: (2025)
by: Youstra, Jack, et al.
Published: (2025)
Debating with More Persuasive LLMs Leads to More Truthful Answers
by: Khan, Akbir, et al.
Published: (2024)
by: Khan, Akbir, et al.
Published: (2024)
Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
by: Griffin, Charlie, et al.
Published: (2024)
by: Griffin, Charlie, et al.
Published: (2024)
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
by: Sheshadri, Abhay, et al.
Published: (2024)
by: Sheshadri, Abhay, et al.
Published: (2024)
Best-of-N Jailbreaking
by: Hughes, John, et al.
Published: (2024)
by: Hughes, John, et al.
Published: (2024)
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
by: Wang, Tony T., et al.
Published: (2024)
by: Wang, Tony T., et al.
Published: (2024)
Agentic Misalignment: How LLMs Could Be Insider Threats
by: Lynch, Aengus, et al.
Published: (2025)
by: Lynch, Aengus, et al.
Published: (2025)
Language Models Learn to Mislead Humans via RLHF
by: Wen, Jiaxin, et al.
Published: (2024)
by: Wen, Jiaxin, et al.
Published: (2024)
AI Control: Improving Safety Despite Intentional Subversion
by: Greenblatt, Ryan, et al.
Published: (2023)
by: Greenblatt, Ryan, et al.
Published: (2023)
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
by: Korbak, Tomek, et al.
Published: (2025)
by: Korbak, Tomek, et al.
Published: (2025)
Removing Sandbagging in LLMs by Training with Weak Supervision
by: Ryd, Emil, et al.
Published: (2026)
by: Ryd, Emil, et al.
Published: (2026)
Language models are better than humans at next-token prediction
by: Shlegeris, Buck, et al.
Published: (2022)
by: Shlegeris, Buck, et al.
Published: (2022)
Mock Theta Functions as Optimal Stopping Criteria for Photonic Quantum Entropy Computation
by: Ansh Sharma, Ansh, et al.
Published: (2026)
by: Ansh Sharma, Ansh, et al.
Published: (2026)
Factorio Learning Environment
by: Hopkins, Jack, et al.
Published: (2025)
by: Hopkins, Jack, et al.
Published: (2025)
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
by: Mallen, Alex, et al.
Published: (2024)
by: Mallen, Alex, et al.
Published: (2024)
Polysemanticity and Capacity in Neural Networks
by: Scherlis, Adam, et al.
Published: (2022)
by: Scherlis, Adam, et al.
Published: (2022)
A sketch of an AI control safety case
by: Korbak, Tomek, et al.
Published: (2025)
by: Korbak, Tomek, et al.
Published: (2025)
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
by: Kutasov, Jonathan, et al.
Published: (2025)
by: Kutasov, Jonathan, et al.
Published: (2025)
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
by: Schaeffer, Rylan, et al.
Published: (2024)
by: Schaeffer, Rylan, et al.
Published: (2024)
Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization
by: Feng, William, et al.
Published: (2026)
by: Feng, William, et al.
Published: (2026)
The Duty of Knowing Oneself as One Appears: A Response to Kant’s Problem of Moral Self-Knowledge
by: Vivek Kumar Radhakrishnan
Published: (2019)
by: Vivek Kumar Radhakrishnan
Published: (2019)
The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
by: Hägele, Alexander, et al.
Published: (2026)
by: Hägele, Alexander, et al.
Published: (2026)
Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
by: Cook, Jonathan, et al.
Published: (2025)
by: Cook, Jonathan, et al.
Published: (2025)
Unsupervised Elicitation of Language Models
by: Wen, Jiaxin, et al.
Published: (2025)
by: Wen, Jiaxin, et al.
Published: (2025)
All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
by: Guo, Shiyuan, et al.
Published: (2025)
by: Guo, Shiyuan, et al.
Published: (2025)
The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
by: Ensign, Danielle, et al.
Published: (2025)
by: Ensign, Danielle, et al.
Published: (2025)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by: Hubinger, Evan, et al.
Published: (2024)
by: Hubinger, Evan, et al.
Published: (2024)
LLMs as Debate Partners: Utilizing Genetic Algorithms and Adversarial Search for Adaptive Arguments
by: Aryan, Prakash
Published: (2024)
by: Aryan, Prakash
Published: (2024)
Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning
by: Rathbun, Ethan, et al.
Published: (2026)
by: Rathbun, Ethan, et al.
Published: (2026)
Alignment faking in large language models
by: Greenblatt, Ryan, et al.
Published: (2024)
by: Greenblatt, Ryan, et al.
Published: (2024)
Who's in Charge? Disempowerment Patterns in Real-World LLM Usage
by: Sharma, Mrinank, et al.
Published: (2026)
by: Sharma, Mrinank, et al.
Published: (2026)
Incorporating Unlabelled Data into Bayesian Neural Networks
by: Sharma, Mrinank, et al.
Published: (2023)
by: Sharma, Mrinank, et al.
Published: (2023)
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
by: Li, Qinfeng, et al.
Published: (2026)
by: Li, Qinfeng, et al.
Published: (2026)
$κ$-solutions with the round cylinder as an asymptotic shrinker
by: Hebbar, Aprameya Girish
Published: (2026)
by: Hebbar, Aprameya Girish
Published: (2026)
Hybrid Implementation for Untrusted-node-based Quantum Key Distribution Network
by: Liu, Jingyang, et al.
Published: (2025)
by: Liu, Jingyang, et al.
Published: (2025)
Believe It or Not: How Deeply do LLMs Believe Implanted Facts?
by: Slocum, Stewart, et al.
Published: (2025)
by: Slocum, Stewart, et al.
Published: (2025)
Similar Items
-
Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases
by: Gan, Eric, et al.
Published: (2026) -
Evaluating Control Protocols for Untrusted AI Agents
by: Kutasov, Jon, et al.
Published: (2025) -
Ctrl-Z: Controlling AI Agents via Resampling
by: Bhatt, Aryan, et al.
Published: (2025) -
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
by: Peng, Alwin, et al.
Published: (2024) -
Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
by: Youstra, Jack, et al.
Published: (2025)