Saved in:
| Main Authors: | Lee, Andrew, Belinkov, Yonatan, Viégas, Fernanda, Wattenberg, Martin |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.04752 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
by: Lee, Andrew, et al.
Published: (2026)
by: Lee, Andrew, et al.
Published: (2026)
Shared Global and Local Geometry of Language Model Embeddings
by: Lee, Andrew, et al.
Published: (2025)
by: Lee, Andrew, et al.
Published: (2025)
Relational Composition in Neural Networks: A Survey and Call to Action
by: Wattenberg, Martin, et al.
Published: (2024)
by: Wattenberg, Martin, et al.
Published: (2024)
The Geometry of Self-Verification in a Task-Specific Reasoning Model
by: Lee, Andrew, et al.
Published: (2025)
by: Lee, Andrew, et al.
Published: (2025)
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
by: Li, Kenneth, et al.
Published: (2024)
by: Li, Kenneth, et al.
Published: (2024)
When Bad Data Leads to Good Models
by: Li, Kenneth, et al.
Published: (2025)
by: Li, Kenneth, et al.
Published: (2025)
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
by: Orgad, Hadas, et al.
Published: (2026)
by: Orgad, Hadas, et al.
Published: (2026)
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
by: Li, Kenneth, et al.
Published: (2023)
by: Li, Kenneth, et al.
Published: (2023)
SAEs Are Good for Steering -- If You Select the Right Features
by: Arad, Dana, et al.
Published: (2025)
by: Arad, Dana, et al.
Published: (2025)
Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls
by: Bai, Xiaoyan, et al.
Published: (2025)
by: Bai, Xiaoyan, et al.
Published: (2025)
Chronotome: Real-Time Topic Modeling for Streaming Embedding Spaces
by: Lim, Matte, et al.
Published: (2025)
by: Lim, Matte, et al.
Published: (2025)
Between the Layers Lies the Truth: Uncertainty Estimation in LLMs Using Intra-Layer Local Information Scores
by: Badash, Zvi N., et al.
Published: (2026)
by: Badash, Zvi N., et al.
Published: (2026)
Accelerating the Global Aggregation of Local Explanations
by: Mor, Alon, et al.
Published: (2023)
by: Mor, Alon, et al.
Published: (2023)
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
by: Li, Kenneth, et al.
Published: (2022)
by: Li, Kenneth, et al.
Published: (2022)
Measures of Information Reflect Memorization Patterns
by: Bansal, Rachit, et al.
Published: (2022)
by: Bansal, Rachit, et al.
Published: (2022)
Measuring and Controlling Instruction (In)Stability in Language Model Dialogs
by: Li, Kenneth, et al.
Published: (2024)
by: Li, Kenneth, et al.
Published: (2024)
Leveraging Prototypical Representations for Mitigating Social Bias without Demographic Information
by: Iskander, Shadi, et al.
Published: (2024)
by: Iskander, Shadi, et al.
Published: (2024)
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
by: Itzhak, Itay, et al.
Published: (2025)
by: Itzhak, Itay, et al.
Published: (2025)
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
by: Hanna, Michael, et al.
Published: (2024)
by: Hanna, Michael, et al.
Published: (2024)
Story Ribbons: Reimagining Storyline Visualizations with Large Language Models
by: Yeh, Catherine, et al.
Published: (2025)
by: Yeh, Catherine, et al.
Published: (2025)
Fast Forwarding Low-Rank Training
by: Rahamim, Adir, et al.
Published: (2024)
by: Rahamim, Adir, et al.
Published: (2024)
Decomposing Global Feature Effects Based on Feature Interactions
by: Herbinger, Julia, et al.
Published: (2023)
by: Herbinger, Julia, et al.
Published: (2023)
Decomposing Gaussians with Unknown Covariance
by: Dharamshi, Ameer, et al.
Published: (2024)
by: Dharamshi, Ameer, et al.
Published: (2024)
From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
by: Itzhak, Itay, et al.
Published: (2026)
by: Itzhak, Itay, et al.
Published: (2026)
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
by: Katz, Shahar, et al.
Published: (2024)
by: Katz, Shahar, et al.
Published: (2024)
Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias
by: Itzhak, Itay, et al.
Published: (2023)
by: Itzhak, Itay, et al.
Published: (2023)
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
by: Marks, Samuel, et al.
Published: (2024)
by: Marks, Samuel, et al.
Published: (2024)
Silent Tokens, Loud Effects: Padding in LLMs
by: Himelstein, Rom, et al.
Published: (2025)
by: Himelstein, Rom, et al.
Published: (2025)
What Does it Mean for a Neural Network to Learn a "World Model"?
by: Li, Kenneth, et al.
Published: (2025)
by: Li, Kenneth, et al.
Published: (2025)
Explaining Categorical Feature Interactions Using Graph Covariance and LLMs
by: Shen, Cencheng, et al.
Published: (2025)
by: Shen, Cencheng, et al.
Published: (2025)
Unified Concept Editing in Diffusion Models
by: Gandikota, Rohit, et al.
Published: (2023)
by: Gandikota, Rohit, et al.
Published: (2023)
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
by: Prakash, Nikhil, et al.
Published: (2024)
by: Prakash, Nikhil, et al.
Published: (2024)
Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models
by: Pomerants, Gal, et al.
Published: (2026)
by: Pomerants, Gal, et al.
Published: (2026)
Position-aware Automatic Circuit Discovery
by: Haklay, Tal, et al.
Published: (2025)
by: Haklay, Tal, et al.
Published: (2025)
ContraSim -- Analyzing Neural Representations Based on Contrastive Learning
by: Rahamim, Adir, et al.
Published: (2023)
by: Rahamim, Adir, et al.
Published: (2023)
Mechanisms of AI Protein Folding in ESMFold
by: Lu, Kevin, et al.
Published: (2026)
by: Lu, Kevin, et al.
Published: (2026)
Does visualization help AI understand data?
by: Li, Victoria R., et al.
Published: (2025)
by: Li, Victoria R., et al.
Published: (2025)
Confidence Regulation Neurons in Language Models
by: Stolfo, Alessandro, et al.
Published: (2024)
by: Stolfo, Alessandro, et al.
Published: (2024)
Q-Probe: A Lightweight Approach to Reward Maximization for Language Models
by: Li, Kenneth, et al.
Published: (2024)
by: Li, Kenneth, et al.
Published: (2024)
Coupled Query-Key Dynamics for Attention
by: Gahtan, Barak, et al.
Published: (2026)
by: Gahtan, Barak, et al.
Published: (2026)
Similar Items
-
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
by: Lee, Andrew, et al.
Published: (2026) -
Shared Global and Local Geometry of Language Model Embeddings
by: Lee, Andrew, et al.
Published: (2025) -
Relational Composition in Neural Networks: A Survey and Call to Action
by: Wattenberg, Martin, et al.
Published: (2024) -
The Geometry of Self-Verification in a Task-Specific Reasoning Model
by: Lee, Andrew, et al.
Published: (2025) -
Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner
by: Li, Kenneth, et al.
Published: (2024)