Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Bentley, Kate H., Belli, Luca, Chekroud, Adam M., Ward, Emily J., Dworkin, Emily R., Van Ark, Emily, Johnston, Kelly M., Alexander, Will, Brown, Millard, Hawrilenko, Matt
Format: Preprint
Veröffentlicht: 2026
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2602.05088
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866915803371143168
author Bentley, Kate H.
Belli, Luca
Chekroud, Adam M.
Ward, Emily J.
Dworkin, Emily R.
Van Ark, Emily
Johnston, Kelly M.
Alexander, Will
Brown, Millard
Hawrilenko, Matt
author_facet Bentley, Kate H.
Belli, Luca
Chekroud, Adam M.
Ward, Emily J.
Dworkin, Emily R.
Van Ark, Emily
Johnston, Kelly M.
Alexander, Will
Brown, Millard
Hawrilenko, Matt
contents Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians' ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.
format Preprint
id arxiv_https___arxiv_org_abs_2602_05088
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
Bentley, Kate H.
Belli, Luca
Chekroud, Adam M.
Ward, Emily J.
Dworkin, Emily R.
Van Ark, Emily
Johnston, Kelly M.
Alexander, Will
Brown, Millard
Hawrilenko, Matt
Artificial Intelligence
Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians' ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.
title VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
topic Artificial Intelligence
url https://arxiv.org/abs/2602.05088