Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	He, Xuanli, Sel, Bilgehan, Ali, Faizan, Bao, Jenny, Cunningham, Hoagy, Wei, Jerry
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Cryptography and Security
Online Access:	https://arxiv.org/abs/2604.14865
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914479574351872
author	He, Xuanli Sel, Bilgehan Ali, Faizan Bao, Jenny Cunningham, Hoagy Wei, Jerry
author_facet	He, Xuanli Sel, Bilgehan Ali, Faizan Bao, Jenny Cunningham, Hoagy Wei, Jerry
contents	Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_14865
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Segment-Level Coherence for Robust Harmful Intent Probing in LLMs He, Xuanli Sel, Bilgehan Ali, Faizan Bao, Jenny Cunningham, Hoagy Wei, Jerry Computation and Language Cryptography and Security Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.
title	Segment-Level Coherence for Robust Harmful Intent Probing in LLMs
topic	Computation and Language Cryptography and Security
url	https://arxiv.org/abs/2604.14865

Similar Items