Saved in:
| Main Authors: | Karvonen, Adam, Reuter, Daniel, Rinberg, Roy, Marks, Luke, Garriga-Alonso, Adrià, Warr, Keri |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.20621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Verifying LLM Inference to Detect Model Weight Exfiltration
by: Rinberg, Roy, et al.
Published: (2025)
by: Rinberg, Roy, et al.
Published: (2025)
Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains
by: Rinberg, Roy, et al.
Published: (2026)
by: Rinberg, Roy, et al.
Published: (2026)
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
by: Golechha, Satvik, et al.
Published: (2025)
by: Golechha, Satvik, et al.
Published: (2025)
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026)
by: Chanin, David, et al.
Published: (2026)
Robustly Improving LLM Fairness in Realistic Settings via Interpretability
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)
by: Chanin, David, et al.
Published: (2025)
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
by: Taufeeque, Mohammad, et al.
Published: (2025)
by: Taufeeque, Mohammad, et al.
Published: (2025)
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)
by: Chanin, David, et al.
Published: (2025)
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
by: Arcuschin, Iván, et al.
Published: (2026)
by: Arcuschin, Iván, et al.
Published: (2026)
Interpreting Emergent Planning in Model-Free Reinforcement Learning
by: Bush, Thomas, et al.
Published: (2025)
by: Bush, Thomas, et al.
Published: (2025)
Planning in a recurrent neural network that plays Sokoban
by: Taufeeque, Mohammad, et al.
Published: (2024)
by: Taufeeque, Mohammad, et al.
Published: (2024)
Have it your way: Individualized Privacy Assignment for DP-SGD
by: Boenisch, Franziska, et al.
Published: (2023)
by: Boenisch, Franziska, et al.
Published: (2023)
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
by: Casademunt, Helena, et al.
Published: (2025)
by: Casademunt, Helena, et al.
Published: (2025)
Optimistic Verifiable Training by Controlling Hardware Nondeterminism
by: Srivastava, Megha, et al.
Published: (2024)
by: Srivastava, Megha, et al.
Published: (2024)
Hypothesis Testing the Circuit Hypothesis in LLMs
by: Shi, Claudia, et al.
Published: (2024)
by: Shi, Claudia, et al.
Published: (2024)
Learning Multi-Level Features with Matryoshka Sparse Autoencoders
by: Bussmann, Bart, et al.
Published: (2025)
by: Bussmann, Bart, et al.
Published: (2025)
Data Science In Olfaction
by: Agarwal, Vivek, et al.
Published: (2024)
by: Agarwal, Vivek, et al.
Published: (2024)
Investigating the Indirect Object Identification circuit in Mamba
by: Ensign, Danielle, et al.
Published: (2024)
by: Ensign, Danielle, et al.
Published: (2024)
Automatically Finding Rule-Based Neurons in OthelloGPT
by: Singh, Aditya, et al.
Published: (2025)
by: Singh, Aditya, et al.
Published: (2025)
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
by: Karvonen, Adam, et al.
Published: (2024)
by: Karvonen, Adam, et al.
Published: (2024)
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
by: Cox, Kyle, et al.
Published: (2026)
by: Cox, Kyle, et al.
Published: (2026)
Semi-Supervised Learning for Deep Causal Generative Models
by: Ibrahim, Yasin, et al.
Published: (2024)
by: Ibrahim, Yasin, et al.
Published: (2024)
ConDiSim: Conditional Diffusion Models for Simulation Based Inference
by: Nautiyal, Mayank, et al.
Published: (2025)
by: Nautiyal, Mayank, et al.
Published: (2025)
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
by: Kwa, Thomas, et al.
Published: (2024)
by: Kwa, Thomas, et al.
Published: (2024)
Adversarial Circuit Evaluation
by: de Bos, Niels uit, et al.
Published: (2024)
by: de Bos, Niels uit, et al.
Published: (2024)
Specialised or Generic? Tokenization Choices for Radiology Language Models
by: Warr, Hermione, et al.
Published: (2025)
by: Warr, Hermione, et al.
Published: (2025)
Negation Neglect: When models fail to learn negations in training
by: Mayne, Harry, et al.
Published: (2026)
by: Mayne, Harry, et al.
Published: (2026)
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
by: Oozeer, Narmeen, et al.
Published: (2025)
by: Oozeer, Narmeen, et al.
Published: (2025)
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification
by: Zhao, Eric, et al.
Published: (2025)
by: Zhao, Eric, et al.
Published: (2025)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification
by: Liang, Zhenwen, et al.
Published: (2024)
by: Liang, Zhenwen, et al.
Published: (2024)
TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research
by: Harrasse, Abir, et al.
Published: (2025)
by: Harrasse, Abir, et al.
Published: (2025)
Output Supervision Can Obfuscate the Chain of Thought
by: Drori, Jacob, et al.
Published: (2025)
by: Drori, Jacob, et al.
Published: (2025)
When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
by: Wang, Zehao, et al.
Published: (2026)
by: Wang, Zehao, et al.
Published: (2026)
Compute Aligned Training: Optimizing for Test Time Inference
by: Ousherovitch, Adam, et al.
Published: (2026)
by: Ousherovitch, Adam, et al.
Published: (2026)
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
by: Gupta, Rohan, et al.
Published: (2024)
by: Gupta, Rohan, et al.
Published: (2024)
Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment
by: Sriraman, Ved, et al.
Published: (2026)
by: Sriraman, Ved, et al.
Published: (2026)
ZKLoRA: Efficient Zero-Knowledge Proofs for LoRA Verification
by: Roy, Bidhan, et al.
Published: (2025)
by: Roy, Bidhan, et al.
Published: (2025)
ReDi: Rectified Discrete Flow
by: Yoo, Jaehoon, et al.
Published: (2025)
by: Yoo, Jaehoon, et al.
Published: (2025)
DiEC: Diffusion Embedded Clustering
by: Hu, Haidong, et al.
Published: (2025)
by: Hu, Haidong, et al.
Published: (2025)
Similar Items
-
Verifying LLM Inference to Detect Model Weight Exfiltration
by: Rinberg, Roy, et al.
Published: (2025) -
Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains
by: Rinberg, Roy, et al.
Published: (2026) -
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
by: Golechha, Satvik, et al.
Published: (2025) -
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026) -
Robustly Improving LLM Fairness in Realistic Settings via Interpretability
by: Karvonen, Adam, et al.
Published: (2025)