Saved in:
| Main Authors: | Menon, Abhinav, Shrivastava, Manish, Krueger, David, Lubana, Ekdeep Singh |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.11767 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
by: Jaipersaud, Brandon, et al.
Published: (2025)
by: Jaipersaud, Brandon, et al.
Published: (2025)
A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language
by: Lubana, Ekdeep Singh, et al.
Published: (2024)
by: Lubana, Ekdeep Singh, et al.
Published: (2024)
Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task
by: Okawa, Maya, et al.
Published: (2023)
by: Okawa, Maya, et al.
Published: (2023)
Abrupt Learning in Transformers: A Case Study on Matrix Completion
by: Gopalani, Pulkit, et al.
Published: (2024)
by: Gopalani, Pulkit, et al.
Published: (2024)
Detecting High-Stakes Interactions with Activation Probes
by: McKenzie, Alex, et al.
Published: (2025)
by: McKenzie, Alex, et al.
Published: (2025)
Competition Dynamics Shape Algorithmic Phases of In-Context Learning
by: Park, Core Francisco, et al.
Published: (2024)
by: Park, Core Francisco, et al.
Published: (2024)
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
by: Jain, Samyak, et al.
Published: (2023)
by: Jain, Samyak, et al.
Published: (2023)
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
by: Costa, Valérie, et al.
Published: (2025)
by: Costa, Valérie, et al.
Published: (2025)
Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit
by: Costa, Valérie, et al.
Published: (2025)
by: Costa, Valérie, et al.
Published: (2025)
Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry
by: Hindupur, Sai Sumedh R., et al.
Published: (2025)
by: Hindupur, Sai Sumedh R., et al.
Published: (2025)
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
by: Ramesh, Rahul, et al.
Published: (2023)
by: Ramesh, Rahul, et al.
Published: (2023)
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
by: Park, Core Francisco, et al.
Published: (2024)
by: Park, Core Francisco, et al.
Published: (2024)
Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing
by: Nishi, Kento, et al.
Published: (2024)
by: Nishi, Kento, et al.
Published: (2024)
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
by: Pres, Itamar, et al.
Published: (2024)
by: Pres, Itamar, et al.
Published: (2024)
The Impact of Off-Policy Training Data on Probe Generalisation
by: Kirch, Nathalie, et al.
Published: (2025)
by: Kirch, Nathalie, et al.
Published: (2025)
Swing-by Dynamics in Concept Learning and Compositional Generalization
by: Yang, Yongyi, et al.
Published: (2024)
by: Yang, Yongyi, et al.
Published: (2024)
Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
by: Prasad, Aaditya Vikram, et al.
Published: (2026)
by: Prasad, Aaditya Vikram, et al.
Published: (2026)
Emergence of Hierarchical Emotion Organization in Large Language Models
by: Zhao, Bo, et al.
Published: (2025)
by: Zhao, Bo, et al.
Published: (2025)
From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?
by: Mueller, Aaron, et al.
Published: (2025)
by: Mueller, Aaron, et al.
Published: (2025)
In-Context Learning Dynamics with Random Binary Sequences
by: Bigelow, Eric J., et al.
Published: (2023)
by: Bigelow, Eric J., et al.
Published: (2023)
In-Context Learning Strategies Emerge Rationally
by: Wurgaft, Daniel, et al.
Published: (2025)
by: Wurgaft, Daniel, et al.
Published: (2025)
Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction
by: Shrivastava, Gaurav, et al.
Published: (2024)
by: Shrivastava, Gaurav, et al.
Published: (2024)
Towards an Understanding of Stepwise Inference in Transformers: A Synthetic Graph Navigation Model
by: Khona, Mikail, et al.
Published: (2024)
by: Khona, Mikail, et al.
Published: (2024)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
by: Jain, Samyak, et al.
Published: (2024)
by: Jain, Samyak, et al.
Published: (2024)
Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
by: Bigelow, Eric, et al.
Published: (2025)
by: Bigelow, Eric, et al.
Published: (2025)
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
by: Huang, Jing, et al.
Published: (2026)
by: Huang, Jing, et al.
Published: (2026)
Tokenized SAEs: Disentangling SAE Reconstructions
by: Dooms, Thomas, et al.
Published: (2025)
by: Dooms, Thomas, et al.
Published: (2025)
Distribution-Aware Feature Selection for SAEs
by: Oozeer, Narmeen, et al.
Published: (2025)
by: Oozeer, Narmeen, et al.
Published: (2025)
The Rate-Distortion-Polysemanticity Tradeoff in SAEs
by: Mencattini, Tommaso, et al.
Published: (2026)
by: Mencattini, Tommaso, et al.
Published: (2026)
Resa: Transparent Reasoning Models via SAEs
by: Wang, Shangshang, et al.
Published: (2025)
by: Wang, Shangshang, et al.
Published: (2025)
ICLR: In-Context Learning of Representations
by: Park, Core Francisco, et al.
Published: (2024)
by: Park, Core Francisco, et al.
Published: (2024)
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
by: Bigelow, Eric, et al.
Published: (2026)
by: Bigelow, Eric, et al.
Published: (2026)
Video Decomposition Prior: A Methodology to Decompose Videos into Layers
by: Shrivastava, Gaurav, et al.
Published: (2024)
by: Shrivastava, Gaurav, et al.
Published: (2024)
Residual Stream Analysis with Multi-Layer SAEs
by: Lawson, Tim, et al.
Published: (2024)
by: Lawson, Tim, et al.
Published: (2024)
Mechanistic Interpretability with SAEs: Probing Religion, Violence, and Geography in Large Language Models
by: Simbeck, Katharina, et al.
Published: (2025)
by: Simbeck, Katharina, et al.
Published: (2025)
Utilization of Neighbor Information for Image Classification with Different Levels of Supervision
by: Jayatilaka, Gihan, et al.
Published: (2025)
by: Jayatilaka, Gihan, et al.
Published: (2025)
Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
by: Korznikov, Anton, et al.
Published: (2026)
by: Korznikov, Anton, et al.
Published: (2026)
Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design
by: Brzozowski, Michał, et al.
Published: (2026)
by: Brzozowski, Michał, et al.
Published: (2026)
Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
by: Ahsan, Hiba, et al.
Published: (2025)
by: Ahsan, Hiba, et al.
Published: (2025)
SAEs Are Good for Steering -- If You Select the Right Features
by: Arad, Dana, et al.
Published: (2025)
by: Arad, Dana, et al.
Published: (2025)
Similar Items
-
How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations
by: Jaipersaud, Brandon, et al.
Published: (2025) -
A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language
by: Lubana, Ekdeep Singh, et al.
Published: (2024) -
Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task
by: Okawa, Maya, et al.
Published: (2023) -
Abrupt Learning in Transformers: A Case Study on Matrix Completion
by: Gopalani, Pulkit, et al.
Published: (2024) -
Detecting High-Stakes Interactions with Activation Probes
by: McKenzie, Alex, et al.
Published: (2025)