Saved in:
Bibliographic Details
Main Authors: Pope, Quintin, Balaji, Ajay Hayagreeve, Thibodeau, Jacques, Fern, Xiaoli
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.05090
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910194740494336
author Pope, Quintin
Balaji, Ajay Hayagreeve
Thibodeau, Jacques
Fern, Xiaoli
author_facet Pope, Quintin
Balaji, Ajay Hayagreeve
Thibodeau, Jacques
Fern, Xiaoli
contents We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention model $M_2$, our method compares their free-form, multi-token generations across aligned prompt contexts and produces human-readable, statistically validated natural-language hypotheses describing how the models differ, along with recurring themes that summarize patterns across validated hypotheses. We evaluate the approach in synthetic setting by injecting known behavioral changes and showing that the pipeline reliably recovers them. We then apply it to three real-world interventions, reasoning distillation, knowledge editing and unlearning, demonstrating that the method surfaces both intended and unexpected behavioral shifts, distinguishes large from subtle interventions, and does not hallucinate differences when effects are absent or misaligned with the prompt bank. Overall, the pipeline provides a statistically grounded and interpretable tool for post-hoc auditing of intervention-induced changes in model behavior.
format Preprint
id arxiv_https___arxiv_org_abs_2605_05090
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
Pope, Quintin
Balaji, Ajay Hayagreeve
Thibodeau, Jacques
Fern, Xiaoli
Computation and Language
Artificial Intelligence
We present an automated, contrastive evaluation pipeline for auditing the behavioral impact of interventions on large language models. Given a base model $M_1$ and an intervention model $M_2$, our method compares their free-form, multi-token generations across aligned prompt contexts and produces human-readable, statistically validated natural-language hypotheses describing how the models differ, along with recurring themes that summarize patterns across validated hypotheses. We evaluate the approach in synthetic setting by injecting known behavioral changes and showing that the pipeline reliably recovers them. We then apply it to three real-world interventions, reasoning distillation, knowledge editing and unlearning, demonstrating that the method surfaces both intended and unexpected behavioral shifts, distinguishes large from subtle interventions, and does not hallucinate differences when effects are absent or misaligned with the prompt bank. Overall, the pipeline provides a statistically grounded and interpretable tool for post-hoc auditing of intervention-induced changes in model behavior.
title Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2605.05090