Saved in:
Bibliographic Details
Main Authors: Liu, Wei, Xu, Haomei, Liu, Bingqing, Deng, Zhiying, Wang, Haozhao, Wang, Jun, Li, Ruixuan, Teh, Yee Whye, Lee, Wee Sun
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.00625
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912620888457216
author Liu, Wei
Xu, Haomei
Liu, Bingqing
Deng, Zhiying
Wang, Haozhao
Wang, Jun
Li, Ruixuan
Teh, Yee Whye
Lee, Wee Sun
author_facet Liu, Wei
Xu, Haomei
Liu, Bingqing
Deng, Zhiying
Wang, Haozhao
Wang, Jun
Li, Ruixuan
Teh, Yee Whye
Lee, Wee Sun
contents Large language models (LLMs) inevitably encode outdated or incorrect knowledge. Updating, deleting, and forgetting such knowledge is important for alignment, safety, and other issues. To address this issue, model editing has emerged as a promising paradigm: by precisely editing a small subset of parameters such that a specific fact is updated while preserving other knowledge. Despite its great success reported in previous papers, we find the apparent reliability of editing rests on a fragile foundation and the current literature is largely driven by illusory success. The fundamental goal of steering the model's output toward a target with minimal modification would encourage exploiting hidden shortcuts, rather than utilizing real semantics. This problem directly challenges the feasibility of the current model editing literature at its very foundation, as shortcuts are inherently at odds with robust knowledge integration. Coincidentally, this issue has long been obscured by evaluation frameworks that lack the design of negative examples. To uncover it, we systematically develop a suite of new evaluation methods. Strikingly, we find that state-of-the-art approaches collapse even under the simplest negation queries. Our empirical evidence shows that editing is likely to be based on shortcuts rather than full semantics, calling for an urgent reconsideration of the very basis of model editing before further advancements can be meaningfully pursued.
format Preprint
id arxiv_https___arxiv_org_abs_2510_00625
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Is Model Editing Built on Sand? Revealing Its Illusory Success and Fragile Foundation
Liu, Wei
Xu, Haomei
Liu, Bingqing
Deng, Zhiying
Wang, Haozhao
Wang, Jun
Li, Ruixuan
Teh, Yee Whye
Lee, Wee Sun
Artificial Intelligence
Large language models (LLMs) inevitably encode outdated or incorrect knowledge. Updating, deleting, and forgetting such knowledge is important for alignment, safety, and other issues. To address this issue, model editing has emerged as a promising paradigm: by precisely editing a small subset of parameters such that a specific fact is updated while preserving other knowledge. Despite its great success reported in previous papers, we find the apparent reliability of editing rests on a fragile foundation and the current literature is largely driven by illusory success. The fundamental goal of steering the model's output toward a target with minimal modification would encourage exploiting hidden shortcuts, rather than utilizing real semantics. This problem directly challenges the feasibility of the current model editing literature at its very foundation, as shortcuts are inherently at odds with robust knowledge integration. Coincidentally, this issue has long been obscured by evaluation frameworks that lack the design of negative examples. To uncover it, we systematically develop a suite of new evaluation methods. Strikingly, we find that state-of-the-art approaches collapse even under the simplest negation queries. Our empirical evidence shows that editing is likely to be based on shortcuts rather than full semantics, calling for an urgent reconsideration of the very basis of model editing before further advancements can be meaningfully pursued.
title Is Model Editing Built on Sand? Revealing Its Illusory Success and Fragile Foundation
topic Artificial Intelligence
url https://arxiv.org/abs/2510.00625