:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wen, Jiaxin, Hebbar, Vivek, Larson, Caleb, Bhatt, Aryan, Radhakrishnan, Ansh, Sharma, Mrinank, Sleight, Henry, Feng, Shi, He, He, Perez, Ethan, Shlegeris, Buck, Khan, Akbir
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2411.17693
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Auditing Sabotage Bench: A Benchmark for Detecting and Fixing Research Sabotage in ML Codebases
by: Gan, Eric, et al.
Published: (2026)

Evaluating Control Protocols for Untrusted AI Agents
by: Kutasov, Jon, et al.
Published: (2025)

Ctrl-Z: Controlling AI Agents via Resampling
by: Bhatt, Aryan, et al.
Published: (2025)

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
by: Peng, Alwin, et al.
Published: (2024)

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
by: Youstra, Jack, et al.
Published: (2025)

Debating with More Persuasive LLMs Leads to More Truthful Answers
by: Khan, Akbir, et al.
Published: (2024)

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
by: Griffin, Charlie, et al.
Published: (2024)

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
by: Sheshadri, Abhay, et al.
Published: (2024)

Best-of-N Jailbreaking
by: Hughes, John, et al.
Published: (2024)

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
by: Wang, Tony T., et al.
Published: (2024)

Agentic Misalignment: How LLMs Could Be Insider Threats
by: Lynch, Aengus, et al.
Published: (2025)

Language Models Learn to Mislead Humans via RLHF
by: Wen, Jiaxin, et al.
Published: (2024)

AI Control: Improving Safety Despite Intentional Subversion
by: Greenblatt, Ryan, et al.
Published: (2023)

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
by: Korbak, Tomek, et al.
Published: (2025)

Removing Sandbagging in LLMs by Training with Weak Supervision
by: Ryd, Emil, et al.
Published: (2026)

Language models are better than humans at next-token prediction
by: Shlegeris, Buck, et al.
Published: (2022)

Mock Theta Functions as Optimal Stopping Criteria for Photonic Quantum Entropy Computation
by: Ansh Sharma, Ansh, et al.
Published: (2026)

Factorio Learning Environment
by: Hopkins, Jack, et al.
Published: (2025)

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
by: Mallen, Alex, et al.
Published: (2024)

Polysemanticity and Capacity in Neural Networks
by: Scherlis, Adam, et al.
Published: (2022)

A sketch of an AI control safety case
by: Korbak, Tomek, et al.
Published: (2025)

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
by: Kutasov, Jonathan, et al.
Published: (2025)

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
by: Schaeffer, Rylan, et al.
Published: (2024)

Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization
by: Feng, William, et al.
Published: (2026)

The Duty of Knowing Oneself as One Appears: A Response to Kant’s Problem of Moral Self-Knowledge
by: Vivek Kumar Radhakrishnan
Published: (2019)

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
by: Hägele, Alexander, et al.
Published: (2026)

Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
by: Cook, Jonathan, et al.
Published: (2025)

Unsupervised Elicitation of Language Models
by: Wen, Jiaxin, et al.
Published: (2025)

All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language
by: Guo, Shiyuan, et al.
Published: (2025)

The LLM Has Left The Chat: Evidence of Bail Preferences in Large Language Models
by: Ensign, Danielle, et al.
Published: (2025)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
by: Hubinger, Evan, et al.
Published: (2024)

LLMs as Debate Partners: Utilizing Genetic Algorithms and Adversarial Search for Adaptive Arguments
by: Aryan, Prakash
Published: (2024)

Beware Untrusted Simulators -- Reward-Free Backdoor Attacks in Reinforcement Learning
by: Rathbun, Ethan, et al.
Published: (2026)

Alignment faking in large language models
by: Greenblatt, Ryan, et al.
Published: (2024)

Who's in Charge? Disempowerment Patterns in Real-World LLM Usage
by: Sharma, Mrinank, et al.
Published: (2026)

Incorporating Unlabelled Data into Bayesian Neural Networks
by: Sharma, Mrinank, et al.
Published: (2023)

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
by: Li, Qinfeng, et al.
Published: (2026)

$κ$-solutions with the round cylinder as an asymptotic shrinker
by: Hebbar, Aprameya Girish
Published: (2026)

Hybrid Implementation for Untrusted-node-based Quantum Key Distribution Network
by: Liu, Jingyang, et al.
Published: (2025)

Believe It or Not: How Deeply do LLMs Believe Implanted Facts?
by: Slocum, Stewart, et al.
Published: (2025)