Saved in:
| Main Authors: | She, Jingyuan Selena, Potts, Christopher, Bowman, Samuel R., Geiger, Atticus |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2305.19426 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
by: Huang, Jing, et al.
Published: (2024)
by: Huang, Jing, et al.
Published: (2024)
ReFT: Representation Finetuning for Language Models
by: Wu, Zhengxuan, et al.
Published: (2024)
by: Wu, Zhengxuan, et al.
Published: (2024)
HyperSteer: Activation Steering at Scale with Hypernetworks
by: Sun, Jiuding, et al.
Published: (2025)
by: Sun, Jiuding, et al.
Published: (2025)
Constructing Interpretable Features from Compositional Neuron Groups
by: Shafran, Or, et al.
Published: (2025)
by: Shafran, Or, et al.
Published: (2025)
How Do Transformers Learn Variable Binding in Symbolic Programs?
by: Wu, Yiwei, et al.
Published: (2025)
by: Wu, Yiwei, et al.
Published: (2025)
Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context
by: Gur-Arieh, Yoav, et al.
Published: (2025)
by: Gur-Arieh, Yoav, et al.
Published: (2025)
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
by: Wu, Zhengxuan, et al.
Published: (2024)
by: Wu, Zhengxuan, et al.
Published: (2024)
How Causal Abstraction Underpins Computational Explanation
by: Geiger, Atticus, et al.
Published: (2025)
by: Geiger, Atticus, et al.
Published: (2025)
HyperDAS: Towards Automating Mechanistic Interpretability with Hypernetworks
by: Sun, Jiuding, et al.
Published: (2025)
by: Sun, Jiuding, et al.
Published: (2025)
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments
by: Wu, Zhengxuan, et al.
Published: (2024)
by: Wu, Zhengxuan, et al.
Published: (2024)
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
by: Csordás, Róbert, et al.
Published: (2024)
by: Csordás, Róbert, et al.
Published: (2024)
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
by: Wu, Zhengxuan, et al.
Published: (2025)
by: Wu, Zhengxuan, et al.
Published: (2025)
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
by: Wu, Zhengxuan, et al.
Published: (2023)
by: Wu, Zhengxuan, et al.
Published: (2023)
Simple Mechanistic Explanations for Out-Of-Context Reasoning
by: Wang, Atticus, et al.
Published: (2025)
by: Wang, Atticus, et al.
Published: (2025)
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
by: Boppana, Siddharth, et al.
Published: (2026)
by: Boppana, Siddharth, et al.
Published: (2026)
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
by: Bigelow, Eric, et al.
Published: (2026)
by: Bigelow, Eric, et al.
Published: (2026)
Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together
by: Soylu, Dilara, et al.
Published: (2024)
by: Soylu, Dilara, et al.
Published: (2024)
Updating CLIP to Prefer Descriptions Over Captions
by: Zur, Amir, et al.
Published: (2024)
by: Zur, Amir, et al.
Published: (2024)
Activation Steering via Generative Causal Mediation
by: Sankaranarayanan, Aruna, et al.
Published: (2026)
by: Sankaranarayanan, Aruna, et al.
Published: (2026)
In-Context Learning and Fine-Tuning GPT for Argument Mining
by: Cabessa, Jérémie, et al.
Published: (2024)
by: Cabessa, Jérémie, et al.
Published: (2024)
Demystifying Verbatim Memorization in Large Language Models
by: Huang, Jing, et al.
Published: (2024)
by: Huang, Jing, et al.
Published: (2024)
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
by: Chaudhary, Maheep, et al.
Published: (2024)
by: Chaudhary, Maheep, et al.
Published: (2024)
Reasoning Towards Fairness: Mitigating Bias in Language Models through Reasoning-Guided Fine-Tuning
by: Kabra, Sanchit, et al.
Published: (2025)
by: Kabra, Sanchit, et al.
Published: (2025)
In-Context Fine-Tuning for Time-Series Foundation Models
by: Das, Abhimanyu, et al.
Published: (2024)
by: Das, Abhimanyu, et al.
Published: (2024)
Fine-Tuning Language Models with Reward Learning on Policy
by: Lang, Hao, et al.
Published: (2024)
by: Lang, Hao, et al.
Published: (2024)
From Directions to Regions: Decomposing Activations in Language Models via Local Geometry
by: Shafran, Or, et al.
Published: (2026)
by: Shafran, Or, et al.
Published: (2026)
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
by: Lee, Chungpa, et al.
Published: (2026)
by: Lee, Chungpa, et al.
Published: (2026)
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
by: Stickland, Asa Cooper, et al.
Published: (2024)
by: Stickland, Asa Cooper, et al.
Published: (2024)
Reasoning Planning for Language Models
by: Nguyen, Bao, et al.
Published: (2025)
by: Nguyen, Bao, et al.
Published: (2025)
Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning
by: Yin, Qingyu, et al.
Published: (2024)
by: Yin, Qingyu, et al.
Published: (2024)
DebateBench: A Challenging Long Context Reasoning Benchmark For Large Language Models
by: Tiwari, Utkarsh, et al.
Published: (2025)
by: Tiwari, Utkarsh, et al.
Published: (2025)
Enhancing Event Reasoning in Large Language Models through Instruction Fine-Tuning with Semantic Causal Graphs
by: Bethany, Mazal, et al.
Published: (2024)
by: Bethany, Mazal, et al.
Published: (2024)
Fine-Tuning Language Models with Just Forward Passes
by: Malladi, Sadhika, et al.
Published: (2023)
by: Malladi, Sadhika, et al.
Published: (2023)
Dissecting Fine-Tuning Unlearning in Large Language Models
by: Hong, Yihuai, et al.
Published: (2024)
by: Hong, Yihuai, et al.
Published: (2024)
LUNE: Efficient LLM Unlearning via LoRA Fine-Tuning with Negative Examples
by: Liu, Yezi, et al.
Published: (2025)
by: Liu, Yezi, et al.
Published: (2025)
Combining Causal Models for More Accurate Abstractions of Neural Networks
by: Pîslar, Theodora-Mara, et al.
Published: (2025)
by: Pîslar, Theodora-Mara, et al.
Published: (2025)
Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
by: Pandey, Punya Syon, et al.
Published: (2025)
by: Pandey, Punya Syon, et al.
Published: (2025)
Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning
by: Cai, Hengrui, et al.
Published: (2023)
by: Cai, Hengrui, et al.
Published: (2023)
Personalized Collaborative Fine-Tuning for On-Device Large Language Models
by: Wagner, Nicolas, et al.
Published: (2024)
by: Wagner, Nicolas, et al.
Published: (2024)
How Multilingual Are Large Language Models Fine-Tuned for Translation?
by: Richburg, Aquia, et al.
Published: (2024)
by: Richburg, Aquia, et al.
Published: (2024)
Similar Items
-
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
by: Huang, Jing, et al.
Published: (2024) -
ReFT: Representation Finetuning for Language Models
by: Wu, Zhengxuan, et al.
Published: (2024) -
HyperSteer: Activation Steering at Scale with Hypernetworks
by: Sun, Jiuding, et al.
Published: (2025) -
Constructing Interpretable Features from Compositional Neuron Groups
by: Shafran, Or, et al.
Published: (2025) -
How Do Transformers Learn Variable Binding in Symbolic Programs?
by: Wu, Yiwei, et al.
Published: (2025)