:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Maximilian, Davies, Xander, Nadeau, Max
Format:	Preprint
Published:	2023
Subjects:	Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2309.05973
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Existing Large Language Model Unlearning Evaluations Are Inconclusive
by: Feng, Zhili, et al.
Published: (2025)

Breaking BERT: Gradient Attack on Twitter Sentiment Analysis for Targeted Misclassification
by: Subedi, Akil Raj, et al.
Published: (2025)

Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
by: Hatefi, Sayed Mohammad Vakilzadeh, et al.
Published: (2025)

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
by: Liu, Hongliang, et al.
Published: (2026)

PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
by: Coalson, Zachary, et al.
Published: (2024)

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
by: Na, Clara, et al.
Published: (2024)

Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
by: Kreutner, Maximilian, et al.
Published: (2025)

Breaking the Attention Bottleneck
by: Hilsenbek, Kalle
Published: (2024)

Judge Circuits
by: Feldhus, Nils, et al.
Published: (2026)

Cyclic Ablation: Testing Concept Localization against Functional Regeneration in AI
by: Kapelko, Eduard
Published: (2025)

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
by: Lochab, Anamika, et al.
Published: (2026)

Circuit Component Reuse Across Tasks in Transformer Language Models
by: Merullo, Jack, et al.
Published: (2023)

Adaptive Circuit Behavior and Generalization in Mechanistic Interpretability
by: Nainani, Jatin, et al.
Published: (2024)

Circuit Complexity Bounds for Visual Autoregressive Model
by: Ke, Yekun, et al.
Published: (2025)

Functional Component Ablation Reveals Specialization Patterns in Hybrid Language Model Architectures
by: Borobia, Hector, et al.
Published: (2026)

Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models
by: Li, Zongqian, et al.
Published: (2026)

Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
by: Mondorf, Philipp, et al.
Published: (2024)

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
by: O'Neill, Charles, et al.
Published: (2024)

Discursive Circuits: How Do Language Models Understand Discourse Relations?
by: Miao, Yisong, et al.
Published: (2025)

Breaking Symmetry When Training Transformers
by: Zuo, Chunsheng, et al.
Published: (2024)

Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
by: Ruis, Laura, et al.
Published: (2024)

STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
by: Dong, Peijie, et al.
Published: (2024)

Single Character Perturbations Break LLM Alignment
by: Lin, Leon, et al.
Published: (2024)

On Mechanistic Circuits for Extractive Question-Answering
by: Basu, Samyadeep, et al.
Published: (2025)

From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs
by: Schiekiera, Louis, et al.
Published: (2026)

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
by: Guan, Zihan, et al.
Published: (2025)

Language Models Improve When Pretraining Data Matches Target Tasks
by: Mizrahi, David, et al.
Published: (2025)

GenFighter: A Generative and Evolutive Textual Attack Removal
by: Islam, Md Athikul, et al.
Published: (2024)

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
by: Casademunt, Helena, et al.
Published: (2025)

Transcoders Find Interpretable LLM Feature Circuits
by: Dunefsky, Jacob, et al.
Published: (2024)

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
by: Andriushchenko, Maksym, et al.
Published: (2024)

Foundation Model's Embedded Representations May Detect Distribution Shift
by: Vargas, Max, et al.
Published: (2023)

Does Alignment Tuning Really Break LLMs' Internal Confidence?
by: Oh, Hongseok, et al.
Published: (2024)

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
by: Fu, Yichao, et al.
Published: (2024)

Breaking MLPerf Training: A Case Study on Optimizing BERT
by: Kim, Yongdeok, et al.
Published: (2024)

SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis
by: Fu, Zihao, et al.
Published: (2026)

LLM Circuit Analyses Are Consistent Across Training and Scale
by: Tigges, Curt, et al.
Published: (2024)

Identifying a Circuit for Verb Conjugation in GPT-2
by: Africa, David Demitri
Published: (2025)

From Data to Knowledge: Evaluating How Efficiently Language Models Learn Facts
by: Christoph, Daniel, et al.
Published: (2025)

Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models
by: Sahili, Zahraa Al, et al.
Published: (2025)