Saved in:
| Main Authors: | Stix, Charlotte, Pistillo, Matteo, Sastry, Girish, Hobbhahn, Marius, Ortega, Alejandro, Balesni, Mikita, Hallensleben, Annika, Goldowsky-Dill, Nix, Sharkey, Lee |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.12170 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Loss of Control Playbook: Degrees, Dynamics, and Preparedness
by: Stix, Charlotte, et al.
Published: (2025)
by: Stix, Charlotte, et al.
Published: (2025)
Large Language Models can Strategically Deceive their Users when Put Under Pressure
by: Scheurer, Jérémy, et al.
Published: (2023)
by: Scheurer, Jérémy, et al.
Published: (2023)
Pre-Deployment Information Sharing: A Zoning Taxonomy for Precursory Capabilities
by: Pistillo, Matteo, et al.
Published: (2024)
by: Pistillo, Matteo, et al.
Published: (2024)
Assurance of Frontier AI Built for National Security
by: Pistillo, Matteo, et al.
Published: (2025)
by: Pistillo, Matteo, et al.
Published: (2025)
Detecting Strategic Deception Using Linear Probes
by: Goldowsky-Dill, Nicholas, et al.
Published: (2025)
by: Goldowsky-Dill, Nicholas, et al.
Published: (2025)
Towards evaluations-based safety cases for AI scheming
by: Balesni, Mikita, et al.
Published: (2024)
by: Balesni, Mikita, et al.
Published: (2024)
Internal Deployment in the AI Act
by: Pistillo, Matteo
Published: (2025)
by: Pistillo, Matteo
Published: (2025)
Frontier Models are Capable of In-context Scheming
by: Meinke, Alexander, et al.
Published: (2024)
by: Meinke, Alexander, et al.
Published: (2024)
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
by: Braun, Dan, et al.
Published: (2024)
by: Braun, Dan, et al.
Published: (2024)
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
by: Laine, Rudolf, et al.
Published: (2024)
by: Laine, Rudolf, et al.
Published: (2024)
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
by: Bushnaq, Lucius, et al.
Published: (2024)
by: Bushnaq, Lucius, et al.
Published: (2024)
Lessons from Studying Two-Hop Latent Reasoning
by: Balesni, Mikita, et al.
Published: (2024)
by: Balesni, Mikita, et al.
Published: (2024)
Stress Testing Deliberative Alignment for Anti-Scheming Training
by: Schoen, Bronson, et al.
Published: (2025)
by: Schoen, Bronson, et al.
Published: (2025)
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
by: Korbak, Tomek, et al.
Published: (2025)
by: Korbak, Tomek, et al.
Published: (2025)
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
by: Bushnaq, Lucius, et al.
Published: (2024)
by: Bushnaq, Lucius, et al.
Published: (2024)
Towards Frontier Safety Policies Plus
by: Pistillo, Matteo
Published: (2025)
by: Pistillo, Matteo
Published: (2025)
Children in Police Custody: Adversity and Adversariality Behind Closed Doors
by: Frances Sheahan
Published: (2025)
by: Frances Sheahan
Published: (2025)
Behind Office Doors
Published: (2026)
Published: (2026)
Behind Office Doors
Published: (2026)
Published: (2026)
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
by: McKee-Reid, Leo, et al.
Published: (2024)
by: McKee-Reid, Leo, et al.
Published: (2024)
Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators
by: Bansal, Hritik, et al.
Published: (2025)
by: Bansal, Hritik, et al.
Published: (2025)
Defending Compute Thresholds Against Legal Loopholes
by: Pistillo, Matteo, et al.
Published: (2025)
by: Pistillo, Matteo, et al.
Published: (2025)
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
by: Berglund, Lukas, et al.
Published: (2023)
by: Berglund, Lukas, et al.
Published: (2023)
Chapter 6 The Role of Corporate Governance in Macro-Prudential Regulation of Systemic Risk
by: Dill, Alexander
Published: (2020)
by: Dill, Alexander
Published: (2020)
Owning the Stuff of Life
by: Stix, Gary
by: Stix, Gary
Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security
by: Pistillo, Matteo, et al.
Published: (2026)
by: Pistillo, Matteo, et al.
Published: (2026)
Technical Report: Evaluating Goal Drift in Language Model Agents
by: Arike, Rauno, et al.
Published: (2025)
by: Arike, Rauno, et al.
Published: (2025)
Forecasting Frontier Language Model Agent Capabilities
by: Pimpale, Govind, et al.
Published: (2025)
by: Pimpale, Govind, et al.
Published: (2025)
Ground states for the Hartree energy functional in the critical case
by: Pistillo, Tommaso
Published: (2025)
by: Pistillo, Tommaso
Published: (2025)
Analogical Reasoning Within a Conceptual Hyperspace
by: Goldowsky, Howard, et al.
Published: (2024)
by: Goldowsky, Howard, et al.
Published: (2024)
Behind Closed Doors: An Exploratory Study of the Perceptions of Librarians and the Hidden Intellectual Work of Collection Development in Canadian Public Libraries.
by: Nilsen, Kirsti, et al.
Published: (2002)
by: Nilsen, Kirsti, et al.
Published: (2002)
The Day the Library Closed Its Doors
by: Yates, Elizabeth
Published: (1970)
by: Yates, Elizabeth
Published: (1970)
A Study of the Bookmobile Service of the Madison Public Library.
by: Nix, Larry T.
Published: (1981)
by: Nix, Larry T.
Published: (1981)
Bibliophilately Revisited.
by: Nix, Larry T.
Published: (2000)
by: Nix, Larry T.
Published: (2000)
Large Language Models Often Know When They Are Being Evaluated
by: Needham, Joe, et al.
Published: (2025)
by: Needham, Joe, et al.
Published: (2025)
Analyzing Probabilistic Methods for Evaluating Agent Capabilities
by: Højmark, Axel, et al.
Published: (2024)
by: Højmark, Axel, et al.
Published: (2024)
The étale topos reconstructs varieties over sub-p-adic fields
by: Carlson, Magnus, et al.
Published: (2024)
by: Carlson, Magnus, et al.
Published: (2024)
Monographs in Microform: Issues in Cataloging and Bibliographic Control.
by: Mikita, Elizabeth G.
Published: (1981)
by: Mikita, Elizabeth G.
Published: (1981)
Hunter Midtown Library: The Closing of an Open Door
by: Foster, Barbara
Published: (1976)
by: Foster, Barbara
Published: (1976)
DoorBot: Closed-Loop Task Planning and Manipulation for Door Opening in the Wild with Haptic Feedback
by: Wang, Zhi, et al.
Published: (2025)
by: Wang, Zhi, et al.
Published: (2025)
Similar Items
-
The Loss of Control Playbook: Degrees, Dynamics, and Preparedness
by: Stix, Charlotte, et al.
Published: (2025) -
Large Language Models can Strategically Deceive their Users when Put Under Pressure
by: Scheurer, Jérémy, et al.
Published: (2023) -
Pre-Deployment Information Sharing: A Zoning Taxonomy for Precursory Capabilities
by: Pistillo, Matteo, et al.
Published: (2024) -
Assurance of Frontier AI Built for National Security
by: Pistillo, Matteo, et al.
Published: (2025) -
Detecting Strategic Deception Using Linear Probes
by: Goldowsky-Dill, Nicholas, et al.
Published: (2025)