Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Salgado, Alberto G. Rodríguez
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.13825
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914563695312896
author	Salgado, Alberto G. Rodríguez
author_facet	Salgado, Alberto G. Rodríguez
contents	Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_13825
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions Salgado, Alberto G. Rodríguez Artificial Intelligence Computer Vision and Pattern Recognition Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.
title	History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
topic	Artificial Intelligence Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.13825

Similar Items