:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tan, Daniel, Chanin, David, Lynch, Aengus, Kanoulas, Dimitrios, Paige, Brooks, Garriga-Alonso, Adria, Kirk, Robert
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2407.12404
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026)

Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)

Biases in the Blind Spot: Detecting What LLMs Fail to Mention
by: Arcuschin, Iván, et al.
Published: (2026)

The Persistent Vulnerability of Aligned AI Systems
by: Lynch, Aengus
Published: (2026)

Are Sparse Autoencoder Benchmarks Reliable?
by: Chanin, David
Published: (2026)

Investigating the Indirect Object Identification circuit in Mamba
by: Ensign, Danielle, et al.
Published: (2024)

Among Us: A Sandbox for Measuring and Detecting Agentic Deception
by: Golechha, Satvik, et al.
Published: (2025)

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
by: Kwa, Thomas, et al.
Published: (2024)

Adversarial Circuit Evaluation
by: de Bos, Niels uit, et al.
Published: (2024)

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
by: Gupta, Rohan, et al.
Published: (2024)

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
by: Taufeeque, Mohammad, et al.
Published: (2025)

Time-varying Factor Augmented Vector Autoregression with Grouped Sparse Autoencoder
by: Luo, Yiyong, et al.
Published: (2025)

Interpreting Emergent Planning in Model-Free Reinforcement Learning
by: Bush, Thomas, et al.
Published: (2025)

Understanding (Un)Reliability of Steering Vectors in Language Models
by: Braun, Joschka, et al.
Published: (2025)

DiFR: Inference Verification Despite Nondeterminism
by: Karvonen, Adam, et al.
Published: (2025)

How Do Large Language Monkeys Get Their Power (Laws)?
by: Schaeffer, Rylan, et al.
Published: (2025)

Causal Machine Learning: A Survey and Open Problems
by: Kaddour, Jean, et al.
Published: (2022)

Planning in a recurrent neural network that plays Sokoban
by: Taufeeque, Mohammad, et al.
Published: (2024)

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
by: Cox, Kyle, et al.
Published: (2026)

Can a Confident Prior Replace a Cold Posterior?
by: Marek, Martin, et al.
Published: (2024)

SmilesT5: Domain-specific pretraining for molecular language models
by: Spence, Philip, et al.
Published: (2025)

Hypothesis Testing the Circuit Hypothesis in LLMs
by: Shi, Claudia, et al.
Published: (2024)

Moment Matching Denoising Gibbs Sampling
by: Zhang, Mingtian, et al.
Published: (2023)

AD4RL: Autonomous Driving Benchmarks for Offline Reinforcement Learning with Value-based Dataset
by: Lee, Dongsu, et al.
Published: (2024)

Effects of Dropout on Performance in Long-range Graph Learning Tasks
by: Singh, Jasraj, et al.
Published: (2025)

White-Box Sensitivity Auditing with Steering Vectors
by: Cyberey, Hannah, et al.
Published: (2026)

Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering
by: Embley-Riches, Jonathan, et al.
Published: (2025)

A study of EHVI vs fixed scalarization for molecule design
by: Yong, Anabel, et al.
Published: (2025)

Gaussian Processes on Cellular Complexes
by: Alain, Mathieu, et al.
Published: (2023)

Predicting Where Steering Vectors Succeed
by: Billa, Jayadev
Published: (2026)

Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
by: Bao, Yuntai, et al.
Published: (2026)

Agentic Misalignment: How LLMs Could Be Insider Threats
by: Lynch, Aengus, et al.
Published: (2025)

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
by: Jin, Zehao, et al.
Published: (2026)

The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute
by: Jin, Yunho, et al.
Published: (2025)

AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction
by: Liu, Chunan, et al.
Published: (2024)

Diffusive Gibbs Sampling
by: Chen, Wenlin, et al.
Published: (2024)

Dialz: A Python Toolkit for Steering Vectors
by: Siddique, Zara, et al.
Published: (2025)

Towards Healing the Blindness of Score Matching
by: Zhang, Mingtian, et al.
Published: (2022)

Learning to Recover: Dynamic Reward Shaping with Wheel-Leg Coordination for Fallen Robots
by: Deng, Boyuan, et al.
Published: (2025)