Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Li, Xiaodan, Wu, Mengjie, Zhu, Yao, Lv, Yunna, Chen, YueFeng, Chen, Cen, Guo, Jianmei, Xue, Hui
Formato:	Preprint
Publicado:	2025
Materias:	Machine Learning Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2510.09694
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866914087443628032
author	Li, Xiaodan Wu, Mengjie Zhu, Yao Lv, Yunna Chen, YueFeng Chen, Cen Guo, Jianmei Xue, Hui
author_facet	Li, Xiaodan Wu, Mengjie Zhu, Yao Lv, Yunna Chen, YueFeng Chen, Cen Guo, Jianmei Xue, Hui
contents	Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_09694
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection Li, Xiaodan Wu, Mengjie Zhu, Yao Lv, Yunna Chen, YueFeng Chen, Cen Guo, Jianmei Xue, Hui Machine Learning Artificial Intelligence Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.
title	Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2510.09694

Ejemplares similares