Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Farinhas, António, Guerreiro, Nuno M., Pombal, José, Martins, Pedro Henrique, Melton, Laura, Conway, Alex, Dochat, Cara, D'Eon, Maya, Rei, Ricardo
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.00950
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910007613718528
author	Farinhas, António Guerreiro, Nuno M. Pombal, José Martins, Pedro Henrique Melton, Laura Conway, Alex Dochat, Cara D'Eon, Maya Rei, Ricardo
author_facet	Farinhas, António Guerreiro, Nuno M. Pombal, José Martins, Pedro Henrique Melton, Laura Conway, Alex Dochat, Cara D'Eon, Maya Rei, Ricardo
contents	Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_00950
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support Farinhas, António Guerreiro, Nuno M. Pombal, José Martins, Pedro Henrique Melton, Laura Conway, Alex Dochat, Cara D'Eon, Maya Rei, Ricardo Artificial Intelligence Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.
title	MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.00950

Similar Items