Guardado en:
Detalles Bibliográficos
Autores principales: Li, Xiaodan, Wu, Mengjie, Zhu, Yao, Lv, Yunna, Chen, YueFeng, Chen, Cen, Guo, Jianmei, Xue, Hui
Formato: Preprint
Publicado: 2025
Materias:
Acceso en línea:https://arxiv.org/abs/2510.09694
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866914087443628032
author Li, Xiaodan
Wu, Mengjie
Zhu, Yao
Lv, Yunna
Chen, YueFeng
Chen, Cen
Guo, Jianmei
Xue, Hui
author_facet Li, Xiaodan
Wu, Mengjie
Zhu, Yao
Lv, Yunna
Chen, YueFeng
Chen, Cen
Guo, Jianmei
Xue, Hui
contents Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.
format Preprint
id arxiv_https___arxiv_org_abs_2510_09694
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection
Li, Xiaodan
Wu, Mengjie
Zhu, Yao
Lv, Yunna
Chen, YueFeng
Chen, Cen
Guo, Jianmei
Xue, Hui
Machine Learning
Artificial Intelligence
Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.
title Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2510.09694