Saved in:
Bibliographic Details
Main Author: Salgado, Alberto G. Rodríguez
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.13825
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914563695312896
author Salgado, Alberto G. Rodríguez
author_facet Salgado, Alberto G. Rodríguez
contents Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.
format Preprint
id arxiv_https___arxiv_org_abs_2605_13825
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
Salgado, Alberto G. Rodríguez
Artificial Intelligence
Computer Vision and Pattern Recognition
Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.
title History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
topic Artificial Intelligence
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.13825