Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Hoang, Nguyen Khoi, Mehri, Shuhaib, Hsu, Tse-An, Sun, Yi-Jyun, Truong, Quynh Xuan Nguyen, Doan, Khoa D, Hakkani-Tür, Dilek
Format:	Preprint
Publié:	2026
Sujets:	Computation and Language Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2604.25840
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866914514562187264
author	Hoang, Nguyen Khoi Mehri, Shuhaib Hsu, Tse-An Sun, Yi-Jyun Truong, Quynh Xuan Nguyen Doan, Khoa D Hakkani-Tür, Dilek
author_facet	Hoang, Nguyen Khoi Mehri, Shuhaib Hsu, Tse-An Sun, Yi-Jyun Truong, Quynh Xuan Nguyen Doan, Khoa D Hakkani-Tür, Dilek
contents	Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_25840
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators Hoang, Nguyen Khoi Mehri, Shuhaib Hsu, Tse-An Sun, Yi-Jyun Truong, Quynh Xuan Nguyen Doan, Khoa D Hakkani-Tür, Dilek Computation and Language Artificial Intelligence Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.
title	PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2604.25840

Documents similaires