Saved in:
| Main Authors: | Tan, Daniel, Chanin, David, Lynch, Aengus, Kanoulas, Dimitrios, Paige, Brooks, Garriga-Alonso, Adria, Kirk, Robert |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.12404 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026)
by: Chanin, David, et al.
Published: (2026)
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)
by: Chanin, David, et al.
Published: (2025)
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025)
by: Chanin, David, et al.
Published: (2025)
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
by: Arcuschin, Iván, et al.
Published: (2026)
by: Arcuschin, Iván, et al.
Published: (2026)
The Persistent Vulnerability of Aligned AI Systems
by: Lynch, Aengus
Published: (2026)
by: Lynch, Aengus
Published: (2026)
Are Sparse Autoencoder Benchmarks Reliable?
by: Chanin, David
Published: (2026)
by: Chanin, David
Published: (2026)
Investigating the Indirect Object Identification circuit in Mamba
by: Ensign, Danielle, et al.
Published: (2024)
by: Ensign, Danielle, et al.
Published: (2024)
Among Us: A Sandbox for Measuring and Detecting Agentic Deception
by: Golechha, Satvik, et al.
Published: (2025)
by: Golechha, Satvik, et al.
Published: (2025)
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
by: Kwa, Thomas, et al.
Published: (2024)
by: Kwa, Thomas, et al.
Published: (2024)
Adversarial Circuit Evaluation
by: de Bos, Niels uit, et al.
Published: (2024)
by: de Bos, Niels uit, et al.
Published: (2024)
InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
by: Gupta, Rohan, et al.
Published: (2024)
by: Gupta, Rohan, et al.
Published: (2024)
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
by: Taufeeque, Mohammad, et al.
Published: (2025)
by: Taufeeque, Mohammad, et al.
Published: (2025)
Time-varying Factor Augmented Vector Autoregression with Grouped Sparse Autoencoder
by: Luo, Yiyong, et al.
Published: (2025)
by: Luo, Yiyong, et al.
Published: (2025)
Interpreting Emergent Planning in Model-Free Reinforcement Learning
by: Bush, Thomas, et al.
Published: (2025)
by: Bush, Thomas, et al.
Published: (2025)
Understanding (Un)Reliability of Steering Vectors in Language Models
by: Braun, Joschka, et al.
Published: (2025)
by: Braun, Joschka, et al.
Published: (2025)
DiFR: Inference Verification Despite Nondeterminism
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
How Do Large Language Monkeys Get Their Power (Laws)?
by: Schaeffer, Rylan, et al.
Published: (2025)
by: Schaeffer, Rylan, et al.
Published: (2025)
Causal Machine Learning: A Survey and Open Problems
by: Kaddour, Jean, et al.
Published: (2022)
by: Kaddour, Jean, et al.
Published: (2022)
Planning in a recurrent neural network that plays Sokoban
by: Taufeeque, Mohammad, et al.
Published: (2024)
by: Taufeeque, Mohammad, et al.
Published: (2024)
Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering
by: Cox, Kyle, et al.
Published: (2026)
by: Cox, Kyle, et al.
Published: (2026)
Can a Confident Prior Replace a Cold Posterior?
by: Marek, Martin, et al.
Published: (2024)
by: Marek, Martin, et al.
Published: (2024)
SmilesT5: Domain-specific pretraining for molecular language models
by: Spence, Philip, et al.
Published: (2025)
by: Spence, Philip, et al.
Published: (2025)
Hypothesis Testing the Circuit Hypothesis in LLMs
by: Shi, Claudia, et al.
Published: (2024)
by: Shi, Claudia, et al.
Published: (2024)
Moment Matching Denoising Gibbs Sampling
by: Zhang, Mingtian, et al.
Published: (2023)
by: Zhang, Mingtian, et al.
Published: (2023)
AD4RL: Autonomous Driving Benchmarks for Offline Reinforcement Learning with Value-based Dataset
by: Lee, Dongsu, et al.
Published: (2024)
by: Lee, Dongsu, et al.
Published: (2024)
Effects of Dropout on Performance in Long-range Graph Learning Tasks
by: Singh, Jasraj, et al.
Published: (2025)
by: Singh, Jasraj, et al.
Published: (2025)
White-Box Sensitivity Auditing with Steering Vectors
by: Cyberey, Hannah, et al.
Published: (2026)
by: Cyberey, Hannah, et al.
Published: (2026)
Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering
by: Embley-Riches, Jonathan, et al.
Published: (2025)
by: Embley-Riches, Jonathan, et al.
Published: (2025)
A study of EHVI vs fixed scalarization for molecule design
by: Yong, Anabel, et al.
Published: (2025)
by: Yong, Anabel, et al.
Published: (2025)
Gaussian Processes on Cellular Complexes
by: Alain, Mathieu, et al.
Published: (2023)
by: Alain, Mathieu, et al.
Published: (2023)
Predicting Where Steering Vectors Succeed
by: Billa, Jayadev
Published: (2026)
by: Billa, Jayadev
Published: (2026)
Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
by: Bao, Yuntai, et al.
Published: (2026)
by: Bao, Yuntai, et al.
Published: (2026)
Agentic Misalignment: How LLMs Could Be Insider Threats
by: Lynch, Aengus, et al.
Published: (2025)
by: Lynch, Aengus, et al.
Published: (2025)
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
by: Jin, Zehao, et al.
Published: (2026)
by: Jin, Zehao, et al.
Published: (2026)
The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute
by: Jin, Yunho, et al.
Published: (2025)
by: Jin, Yunho, et al.
Published: (2025)
AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction
by: Liu, Chunan, et al.
Published: (2024)
by: Liu, Chunan, et al.
Published: (2024)
Diffusive Gibbs Sampling
by: Chen, Wenlin, et al.
Published: (2024)
by: Chen, Wenlin, et al.
Published: (2024)
Dialz: A Python Toolkit for Steering Vectors
by: Siddique, Zara, et al.
Published: (2025)
by: Siddique, Zara, et al.
Published: (2025)
Towards Healing the Blindness of Score Matching
by: Zhang, Mingtian, et al.
Published: (2022)
by: Zhang, Mingtian, et al.
Published: (2022)
Learning to Recover: Dynamic Reward Shaping with Wheel-Leg Coordination for Fallen Robots
by: Deng, Boyuan, et al.
Published: (2025)
by: Deng, Boyuan, et al.
Published: (2025)
Similar Items
-
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data
by: Chanin, David, et al.
Published: (2026) -
Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025) -
Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders
by: Chanin, David, et al.
Published: (2025) -
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
by: Arcuschin, Iván, et al.
Published: (2026) -
The Persistent Vulnerability of Aligned AI Systems
by: Lynch, Aengus
Published: (2026)