Saved in:
| Main Authors: | Bai, Hao, Ma, Yi |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.16443 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation
by: Gu, Jian, et al.
Published: (2023)
by: Gu, Jian, et al.
Published: (2023)
Instruction Learning Paradigms: A Dual Perspective on White-box and Black-box LLMs
by: Ren, Yanwei, et al.
Published: (2025)
by: Ren, Yanwei, et al.
Published: (2025)
Constructing Interpretable Features from Compositional Neuron Groups
by: Shafran, Or, et al.
Published: (2025)
by: Shafran, Or, et al.
Published: (2025)
NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models
by: Liu, Weiqi, et al.
Published: (2026)
by: Liu, Weiqi, et al.
Published: (2026)
Selective Neuron Amplification in Transformer Language Models
by: Akhtar, Ryyan, et al.
Published: (2026)
by: Akhtar, Ryyan, et al.
Published: (2026)
Neuron-Level Knowledge Attribution in Large Language Models
by: Yu, Zeping, et al.
Published: (2023)
by: Yu, Zeping, et al.
Published: (2023)
Online Personalizing White-box LLMs Generation with Neural Bandits
by: Chen, Zekai, et al.
Published: (2024)
by: Chen, Zekai, et al.
Published: (2024)
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models
by: Lin, Zhen, et al.
Published: (2023)
by: Lin, Zhen, et al.
Published: (2023)
Confidence Regulation Neurons in Language Models
by: Stolfo, Alessandro, et al.
Published: (2024)
by: Stolfo, Alessandro, et al.
Published: (2024)
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models
by: Ali, Ameen, et al.
Published: (2025)
by: Ali, Ameen, et al.
Published: (2025)
Variational Language Concepts for Interpreting Foundation Language Models
by: Wang, Hengyi, et al.
Published: (2024)
by: Wang, Hengyi, et al.
Published: (2024)
Smaller Language Models are Better Black-box Machine-Generated Text Detectors
by: Mireshghallah, Niloofar, et al.
Published: (2023)
by: Mireshghallah, Niloofar, et al.
Published: (2023)
GLUScope: A Tool for Analyzing GLU Neurons in Transformer Language Models
by: Gerstner, Sebastian, et al.
Published: (2026)
by: Gerstner, Sebastian, et al.
Published: (2026)
Knowledge Editing on Black-box Large Language Models
by: Song, Xiaoshuai, et al.
Published: (2024)
by: Song, Xiaoshuai, et al.
Published: (2024)
Universal Neurons in GPT2 Language Models
by: Gurnee, Wes, et al.
Published: (2024)
by: Gurnee, Wes, et al.
Published: (2024)
White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?
by: Yu, Yaodong, et al.
Published: (2023)
by: Yu, Yaodong, et al.
Published: (2023)
Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model
by: Liang, Jing, et al.
Published: (2025)
by: Liang, Jing, et al.
Published: (2025)
Unveiling the Influence of Amplifying Language-Specific Neurons
by: Rahmanisa, Inaya, et al.
Published: (2025)
by: Rahmanisa, Inaya, et al.
Published: (2025)
Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models
by: Meeus, Matthieu, et al.
Published: (2023)
by: Meeus, Matthieu, et al.
Published: (2023)
Finding Culture-Sensitive Neurons in Vision-Language Models
by: Zhao, Xiutian, et al.
Published: (2025)
by: Zhao, Xiutian, et al.
Published: (2025)
DALD: Improving Logits-based Detector without Logits from Black-box LLMs
by: Zeng, Cong, et al.
Published: (2024)
by: Zeng, Cong, et al.
Published: (2024)
Crafting Large Language Models for Enhanced Interpretability
by: Sun, Chung-En, et al.
Published: (2024)
by: Sun, Chung-En, et al.
Published: (2024)
Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting
by: Xia, Heming, et al.
Published: (2025)
by: Xia, Heming, et al.
Published: (2025)
Automatically Interpreting Millions of Features in Large Language Models
by: Paulo, Gonçalo, et al.
Published: (2024)
by: Paulo, Gonçalo, et al.
Published: (2024)
A Review of Developmental Interpretability in Large Language Models
by: Kendiukhov, Ihor
Published: (2025)
by: Kendiukhov, Ihor
Published: (2025)
Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention
by: Chang, Shuochen, et al.
Published: (2026)
by: Chang, Shuochen, et al.
Published: (2026)
Interpreting Neurons in Deep Vision Networks with Language Models
by: Bai, Nicholas, et al.
Published: (2024)
by: Bai, Nicholas, et al.
Published: (2024)
Improving Variable-Length Generation in Diffusion Language Models via Length Regularization
by: Cheng, Zicong, et al.
Published: (2026)
by: Cheng, Zicong, et al.
Published: (2026)
Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models
by: Tang, Raphael, et al.
Published: (2023)
by: Tang, Raphael, et al.
Published: (2023)
Learnable Privacy Neurons Localization in Language Models
by: Chen, Ruizhe, et al.
Published: (2024)
by: Chen, Ruizhe, et al.
Published: (2024)
Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding
by: Zhu, Yunchang, et al.
Published: (2023)
by: Zhu, Yunchang, et al.
Published: (2023)
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
by: Huang, Jing, et al.
Published: (2024)
by: Huang, Jing, et al.
Published: (2024)
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
by: Laptev, Daniil, et al.
Published: (2025)
by: Laptev, Daniil, et al.
Published: (2025)
Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models
by: Zhang, Fuxiang, et al.
Published: (2024)
by: Zhang, Fuxiang, et al.
Published: (2024)
Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders
by: Kurochkin, Vadim, et al.
Published: (2025)
by: Kurochkin, Vadim, et al.
Published: (2025)
Repetition Improves Language Model Embeddings
by: Springer, Jacob Mitchell, et al.
Published: (2024)
by: Springer, Jacob Mitchell, et al.
Published: (2024)
Improving Detection of Watermarked Language Models
by: Bahri, Dara, et al.
Published: (2025)
by: Bahri, Dara, et al.
Published: (2025)
Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition
by: Takahashi, Ryo, et al.
Published: (2025)
by: Takahashi, Ryo, et al.
Published: (2025)
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
by: Karvonen, Adam, et al.
Published: (2025)
by: Karvonen, Adam, et al.
Published: (2025)
How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective
by: Peng, Runyu, et al.
Published: (2026)
by: Peng, Runyu, et al.
Published: (2026)
Similar Items
-
Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation
by: Gu, Jian, et al.
Published: (2023) -
Instruction Learning Paradigms: A Dual Perspective on White-box and Black-box LLMs
by: Ren, Yanwei, et al.
Published: (2025) -
Constructing Interpretable Features from Compositional Neuron Groups
by: Shafran, Or, et al.
Published: (2025) -
NeuronScope: A Multi-Agent Framework for Explaining Polysemantic Neurons in Language Models
by: Liu, Weiqi, et al.
Published: (2026) -
Selective Neuron Amplification in Transformer Language Models
by: Akhtar, Ryyan, et al.
Published: (2026)