Saved in:
| Main Authors: | Shafran, Or, Ronen, Shaked, Fahn, Omri, Ravfogel, Shauli, Geiger, Atticus, Geva, Mor |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.02464 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Constructing Interpretable Features from Compositional Neuron Groups
by: Shafran, Or, et al.
Published: (2025)
by: Shafran, Or, et al.
Published: (2025)
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
by: Gur-Arieh, Yoav, et al.
Published: (2025)
by: Gur-Arieh, Yoav, et al.
Published: (2025)
Intrinsic Test of Unlearning Using Parametric Knowledge Traces
by: Hong, Yihuai, et al.
Published: (2024)
by: Hong, Yihuai, et al.
Published: (2024)
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
by: Huang, Jing, et al.
Published: (2024)
by: Huang, Jing, et al.
Published: (2024)
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
by: Gur-Arieh, Yoav, et al.
Published: (2025)
by: Gur-Arieh, Yoav, et al.
Published: (2025)
Gumbel Counterfactual Generation From Language Models
by: Ravfogel, Shauli, et al.
Published: (2024)
by: Ravfogel, Shauli, et al.
Published: (2024)
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
by: Ben-Zaken, Elad, et al.
Published: (2021)
by: Ben-Zaken, Elad, et al.
Published: (2021)
Log-linear Guardedness and its Implications
by: Ravfogel, Shauli, et al.
Published: (2022)
by: Ravfogel, Shauli, et al.
Published: (2022)
Emergence of Linear Truth Encodings in Language Models
by: Ravfogel, Shauli, et al.
Published: (2025)
by: Ravfogel, Shauli, et al.
Published: (2025)
Detecting (Un)answerability in Large Language Models with Linear Directions
by: Lavi, Maor Juliet, et al.
Published: (2025)
by: Lavi, Maor Juliet, et al.
Published: (2025)
Estimating Knowledge in Large Language Models Without Generating a Single Token
by: Gottesman, Daniela, et al.
Published: (2024)
by: Gottesman, Daniela, et al.
Published: (2024)
Diversity Over Quantity: A Lesson From Few Shot Relation Classification
by: Cohen, Amir DN, et al.
Published: (2024)
by: Cohen, Amir DN, et al.
Published: (2024)
Geometric Factual Recall in Transformers
by: Ravfogel, Shauli, et al.
Published: (2026)
by: Ravfogel, Shauli, et al.
Published: (2026)
Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
by: Yona, Itay, et al.
Published: (2026)
by: Yona, Itay, et al.
Published: (2026)
A Practical Method for Generating String Counterfactuals
by: Avitan, Matan, et al.
Published: (2024)
by: Avitan, Matan, et al.
Published: (2024)
RELIC: Evaluating Complex Reasoning via the Recognition of Languages In-Context
by: Petty, Jackson, et al.
Published: (2025)
by: Petty, Jackson, et al.
Published: (2025)
Kernelized Concept Erasure
by: Ravfogel, Shauli, et al.
Published: (2022)
by: Ravfogel, Shauli, et al.
Published: (2022)
State over Tokens: Characterizing the Role of Reasoning Tokens
by: Levy, Mosh, et al.
Published: (2025)
by: Levy, Mosh, et al.
Published: (2025)
Linear Adversarial Concept Erasure
by: Ravfogel, Shauli, et al.
Published: (2022)
by: Ravfogel, Shauli, et al.
Published: (2022)
Activation Steering via Generative Causal Mediation
by: Sankaranarayanan, Aruna, et al.
Published: (2026)
by: Sankaranarayanan, Aruna, et al.
Published: (2026)
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
by: Ahrac, Sagi, et al.
Published: (2026)
by: Ahrac, Sagi, et al.
Published: (2026)
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
by: Yona, Gal, et al.
Published: (2024)
by: Yona, Gal, et al.
Published: (2024)
The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments
by: Schäfer, Anton, et al.
Published: (2024)
by: Schäfer, Anton, et al.
Published: (2024)
From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty
by: Ivgi, Maor, et al.
Published: (2024)
by: Ivgi, Maor, et al.
Published: (2024)
Beyond Single Embeddings: Capturing Diverse Targets with Multi-Query Retrieval
by: Chen, Hung-Ting, et al.
Published: (2025)
by: Chen, Hung-Ting, et al.
Published: (2025)
The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure
by: Fan, Yu, et al.
Published: (2025)
by: Fan, Yu, et al.
Published: (2025)
Pretrained LLMs Learn Multiple Types of Uncertainty
by: Cohen, Roi, et al.
Published: (2025)
by: Cohen, Roi, et al.
Published: (2025)
Inferring Functionality of Attention Heads from their Parameters
by: Elhelo, Amit, et al.
Published: (2024)
by: Elhelo, Amit, et al.
Published: (2024)
IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs
by: Maimon, Aviya, et al.
Published: (2025)
by: Maimon, Aviya, et al.
Published: (2025)
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
by: Cohen, Ido, et al.
Published: (2024)
by: Cohen, Ido, et al.
Published: (2024)
Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
by: Rassin, Royi, et al.
Published: (2023)
by: Rassin, Royi, et al.
Published: (2023)
Indications of Belief-Guided Agency and Meta-Cognitive Monitoring in Large Language Models
by: Yalon, Noam Steinmetz, et al.
Published: (2026)
by: Yalon, Noam Steinmetz, et al.
Published: (2026)
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
by: Katz, Shahar, et al.
Published: (2024)
by: Katz, Shahar, et al.
Published: (2024)
Representation Surgery: Theory and Practice of Affine Steering
by: Singh, Shashwat, et al.
Published: (2024)
by: Singh, Shashwat, et al.
Published: (2024)
Description-Based Text Similarity
by: Ravfogel, Shauli, et al.
Published: (2023)
by: Ravfogel, Shauli, et al.
Published: (2023)
Discrete Diffusion Models Exploit Asymmetry to Solve Lookahead Planning Tasks
by: Trainin, Itamar, et al.
Published: (2026)
by: Trainin, Itamar, et al.
Published: (2026)
Hallucinations Undermine Trust; Metacognition is a Way Forward
by: Yona, Gal, et al.
Published: (2026)
by: Yona, Gal, et al.
Published: (2026)
Eliciting Textual Descriptions from Representations of Continuous Prompts
by: Ramati, Dana, et al.
Published: (2024)
by: Ramati, Dana, et al.
Published: (2024)
Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers
by: Yona, Gal, et al.
Published: (2024)
by: Yona, Gal, et al.
Published: (2024)
Do Large Language Models Latently Perform Multi-Hop Reasoning?
by: Yang, Sohee, et al.
Published: (2024)
by: Yang, Sohee, et al.
Published: (2024)
Similar Items
-
Constructing Interpretable Features from Compositional Neuron Groups
by: Shafran, Or, et al.
Published: (2025) -
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
by: Gur-Arieh, Yoav, et al.
Published: (2025) -
Intrinsic Test of Unlearning Using Parametric Knowledge Traces
by: Hong, Yihuai, et al.
Published: (2024) -
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
by: Huang, Jing, et al.
Published: (2024) -
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
by: Gur-Arieh, Yoav, et al.
Published: (2025)