Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Cui, Christopher Z., Killian, Taylor W., Ammanabrolu, Prithviraj
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.07021
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911699619020800
author	Cui, Christopher Z. Killian, Taylor W. Ammanabrolu, Prithviraj
author_facet	Cui, Christopher Z. Killian, Taylor W. Ammanabrolu, Prithviraj
contents	Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that Behavior Cue Reasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code: https://github.com/christopherzc/behavior-cues
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_07021
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight Cui, Christopher Z. Killian, Taylor W. Ammanabrolu, Prithviraj Artificial Intelligence Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that Behavior Cue Reasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code: https://github.com/christopherzc/behavior-cues
title	Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
topic	Artificial Intelligence
url	https://arxiv.org/abs/2605.07021

Similar Items