:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Karvonen, Adam, Reuter, Daniel, Rinberg, Roy, Marks, Luke, Garriga-Alonso, Adrià, Warr, Keri
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2511.20621
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Verifying LLM Inference to Detect Model Weight Exfiltration
by: Rinberg, Roy, et al.
Published: (2025)

Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains
by: Rinberg, Roy, et al.
Published: (2026)

Among Us: A Sandbox for Measuring and Detecting Agentic Deception
by: Golechha, Satvik, et al.
Published: (2025)

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026)

Robustly Improving LLM Fairness in Realistic Settings via Interpretability
by: Karvonen, Adam, et al.
Published: (2025)

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
by: Taufeeque, Mohammad, et al.
Published: (2025)

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)

Biases in the Blind Spot: Detecting What LLMs Fail to Mention
by: Arcuschin, Iván, et al.
Published: (2026)

Interpreting Emergent Planning in Model-Free Reinforcement Learning
by: Bush, Thomas, et al.
Published: (2025)

Planning in a recurrent neural network that plays Sokoban
by: Taufeeque, Mohammad, et al.
Published: (2024)

Have it your way: Individualized Privacy Assignment for DP-SGD
by: Boenisch, Franziska, et al.
Published: (2023)

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
by: Casademunt, Helena, et al.
Published: (2025)

Optimistic Verifiable Training by Controlling Hardware Nondeterminism
by: Srivastava, Megha, et al.
Published: (2024)

Hypothesis Testing the Circuit Hypothesis in LLMs
by: Shi, Claudia, et al.
Published: (2024)

Learning Multi-Level Features with Matryoshka Sparse Autoencoders
by: Bussmann, Bart, et al.
Published: (2025)

Data Science In Olfaction
by: Agarwal, Vivek, et al.
Published: (2024)

Investigating the Indirect Object Identification circuit in Mamba
by: Ensign, Danielle, et al.
Published: (2024)

Automatically Finding Rule-Based Neurons in OthelloGPT
by: Singh, Aditya, et al.
Published: (2025)

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
by: Karvonen, Adam, et al.
Published: (2024)

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
by: Karvonen, Adam, et al.
Published: (2025)

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
by: Cox, Kyle, et al.
Published: (2026)

Semi-Supervised Learning for Deep Causal Generative Models
by: Ibrahim, Yasin, et al.
Published: (2024)

ConDiSim: Conditional Diffusion Models for Simulation Based Inference
by: Nautiyal, Mayank, et al.
Published: (2025)

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
by: Kwa, Thomas, et al.
Published: (2024)

Adversarial Circuit Evaluation
by: de Bos, Niels uit, et al.
Published: (2024)

Specialised or Generic? Tokenization Choices for Radiology Language Models
by: Warr, Hermione, et al.
Published: (2025)

Negation Neglect: When models fail to learn negations in training
by: Mayne, Harry, et al.
Published: (2026)

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
by: Oozeer, Narmeen, et al.
Published: (2025)

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification
by: Zhao, Eric, et al.
Published: (2025)

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification
by: Liang, Zhenwen, et al.
Published: (2024)

TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research
by: Harrasse, Abir, et al.
Published: (2025)

Output Supervision Can Obfuscate the Chain of Thought
by: Drori, Jacob, et al.
Published: (2025)

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
by: Wang, Zehao, et al.
Published: (2026)

Compute Aligned Training: Optimizing for Test Time Inference
by: Ousherovitch, Adam, et al.
Published: (2026)

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
by: Gupta, Rohan, et al.
Published: (2024)

Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment
by: Sriraman, Ved, et al.
Published: (2026)

ZKLoRA: Efficient Zero-Knowledge Proofs for LoRA Verification
by: Roy, Bidhan, et al.
Published: (2025)

ReDi: Rectified Discrete Flow
by: Yoo, Jaehoon, et al.
Published: (2025)

DiEC: Diffusion Embedded Clustering
by: Hu, Haidong, et al.
Published: (2025)