Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Bentley, Kate H., Belli, Luca, Chekroud, Adam M., Ward, Emily J., Dworkin, Emily R., Van Ark, Emily, Johnston, Kelly M., Alexander, Will, Brown, Millard, Hawrilenko, Matt
Format:	Preprint
Veröffentlicht:	2026
Schlagworte:	Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2602.05088
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866915803371143168
author	Bentley, Kate H. Belli, Luca Chekroud, Adam M. Ward, Emily J. Dworkin, Emily R. Van Ark, Emily Johnston, Kelly M. Alexander, Will Brown, Millard Hawrilenko, Matt
author_facet	Bentley, Kate H. Belli, Luca Chekroud, Adam M. Ward, Emily J. Dworkin, Emily R. Van Ark, Emily Johnston, Kelly M. Alexander, Will Brown, Millard Hawrilenko, Matt
contents	Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians' ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_05088
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health Bentley, Kate H. Belli, Luca Chekroud, Adam M. Ward, Emily J. Dworkin, Emily R. Van Ark, Emily Johnston, Kelly M. Alexander, Will Brown, Millard Hawrilenko, Matt Artificial Intelligence Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians' ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.
title	VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
topic	Artificial Intelligence
url	https://arxiv.org/abs/2602.05088

Ähnliche Einträge