Saved in:
| Main Authors: | Li, Maximilian, Davies, Xander, Nadeau, Max |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2309.05973 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Existing Large Language Model Unlearning Evaluations Are Inconclusive
by: Feng, Zhili, et al.
Published: (2025)
by: Feng, Zhili, et al.
Published: (2025)
Breaking BERT: Gradient Attack on Twitter Sentiment Analysis for Targeted Misclassification
by: Subedi, Akil Raj, et al.
Published: (2025)
by: Subedi, Akil Raj, et al.
Published: (2025)
Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
by: Hatefi, Sayed Mohammad Vakilzadeh, et al.
Published: (2025)
by: Hatefi, Sayed Mohammad Vakilzadeh, et al.
Published: (2025)
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
by: Liu, Hongliang, et al.
Published: (2026)
by: Liu, Hongliang, et al.
Published: (2026)
PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
by: Coalson, Zachary, et al.
Published: (2024)
by: Coalson, Zachary, et al.
Published: (2024)
Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
by: Na, Clara, et al.
Published: (2024)
by: Na, Clara, et al.
Published: (2024)
Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
by: Kreutner, Maximilian, et al.
Published: (2025)
by: Kreutner, Maximilian, et al.
Published: (2025)
Breaking the Attention Bottleneck
by: Hilsenbek, Kalle
Published: (2024)
by: Hilsenbek, Kalle
Published: (2024)
Judge Circuits
by: Feldhus, Nils, et al.
Published: (2026)
by: Feldhus, Nils, et al.
Published: (2026)
Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
by: Kapelko, Eduard
Published: (2025)
by: Kapelko, Eduard
Published: (2025)
Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
by: Lochab, Anamika, et al.
Published: (2026)
by: Lochab, Anamika, et al.
Published: (2026)
Circuit Component Reuse Across Tasks in Transformer Language Models
by: Merullo, Jack, et al.
Published: (2023)
by: Merullo, Jack, et al.
Published: (2023)
Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
by: Nainani, Jatin, et al.
Published: (2024)
by: Nainani, Jatin, et al.
Published: (2024)
Circuit Complexity Bounds for Visual Autoregressive Model
by: Ke, Yekun, et al.
Published: (2025)
by: Ke, Yekun, et al.
Published: (2025)
Functional Component Ablation Reveals Specialization Patterns in Hybrid Language Model Architectures
by: Borobia, Hector, et al.
Published: (2026)
by: Borobia, Hector, et al.
Published: (2026)
Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models
by: Li, Zongqian, et al.
Published: (2026)
by: Li, Zongqian, et al.
Published: (2026)
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
by: Mondorf, Philipp, et al.
Published: (2024)
by: Mondorf, Philipp, et al.
Published: (2024)
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
by: O'Neill, Charles, et al.
Published: (2024)
by: O'Neill, Charles, et al.
Published: (2024)
Discursive Circuits: How Do Language Models Understand Discourse Relations?
by: Miao, Yisong, et al.
Published: (2025)
by: Miao, Yisong, et al.
Published: (2025)
Breaking Symmetry When Training Transformers
by: Zuo, Chunsheng, et al.
Published: (2024)
by: Zuo, Chunsheng, et al.
Published: (2024)
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
by: Ruis, Laura, et al.
Published: (2024)
by: Ruis, Laura, et al.
Published: (2024)
STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
by: Dong, Peijie, et al.
Published: (2024)
by: Dong, Peijie, et al.
Published: (2024)
Single Character Perturbations Break LLM Alignment
by: Lin, Leon, et al.
Published: (2024)
by: Lin, Leon, et al.
Published: (2024)
On Mechanistic Circuits for Extractive Question-Answering
by: Basu, Samyadeep, et al.
Published: (2025)
by: Basu, Samyadeep, et al.
Published: (2025)
From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs
by: Schiekiera, Louis, et al.
Published: (2026)
by: Schiekiera, Louis, et al.
Published: (2026)
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
by: Guan, Zihan, et al.
Published: (2025)
by: Guan, Zihan, et al.
Published: (2025)
Language Models Improve When Pretraining Data Matches Target Tasks
by: Mizrahi, David, et al.
Published: (2025)
by: Mizrahi, David, et al.
Published: (2025)
GenFighter: A Generative and Evolutive Textual Attack Removal
by: Islam, Md Athikul, et al.
Published: (2024)
by: Islam, Md Athikul, et al.
Published: (2024)
Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
by: Casademunt, Helena, et al.
Published: (2025)
by: Casademunt, Helena, et al.
Published: (2025)
Transcoders Find Interpretable LLM Feature Circuits
by: Dunefsky, Jacob, et al.
Published: (2024)
by: Dunefsky, Jacob, et al.
Published: (2024)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
by: Andriushchenko, Maksym, et al.
Published: (2024)
by: Andriushchenko, Maksym, et al.
Published: (2024)
Foundation Model's Embedded Representations May Detect Distribution Shift
by: Vargas, Max, et al.
Published: (2023)
by: Vargas, Max, et al.
Published: (2023)
Does Alignment Tuning Really Break LLMs' Internal Confidence?
by: Oh, Hongseok, et al.
Published: (2024)
by: Oh, Hongseok, et al.
Published: (2024)
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
by: Fu, Yichao, et al.
Published: (2024)
by: Fu, Yichao, et al.
Published: (2024)
Breaking MLPerf Training: A Case Study on Optimizing BERT
by: Kim, Yongdeok, et al.
Published: (2024)
by: Kim, Yongdeok, et al.
Published: (2024)
SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis
by: Fu, Zihao, et al.
Published: (2026)
by: Fu, Zihao, et al.
Published: (2026)
LLM Circuit Analyses Are Consistent Across Training and Scale
by: Tigges, Curt, et al.
Published: (2024)
by: Tigges, Curt, et al.
Published: (2024)
Identifying a Circuit for Verb Conjugation in GPT-2
by: Africa, David Demitri
Published: (2025)
by: Africa, David Demitri
Published: (2025)
From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts
by: Christoph, Daniel, et al.
Published: (2025)
by: Christoph, Daniel, et al.
Published: (2025)
Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models
by: Sahili, Zahraa Al, et al.
Published: (2025)
by: Sahili, Zahraa Al, et al.
Published: (2025)
Similar Items
-
Existing Large Language Model Unlearning Evaluations Are Inconclusive
by: Feng, Zhili, et al.
Published: (2025) -
Breaking BERT: Gradient Attack on Twitter Sentiment Analysis for Targeted Misclassification
by: Subedi, Akil Raj, et al.
Published: (2025) -
Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
by: Hatefi, Sayed Mohammad Vakilzadeh, et al.
Published: (2025) -
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
by: Liu, Hongliang, et al.
Published: (2026) -
PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
by: Coalson, Zachary, et al.
Published: (2024)