Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ni, Congning, Qadir, Sarvech, Steitz, Bryan, Vaidya, Mihir Sachin, Song, Qingyuan, Xia, Lantian, Mulvaney, Shelagh, Liu, Siru, Ryu, Hyeyoung, Hecht, Leah, Bucher, Amy, Symons, Christopher, Novak, Laurie, Rose, Susannah L., Kantarcioglu, Murat, Malin, Bradley, Yin, Zhijun
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Human-Computer Interaction
Online Access:	https://arxiv.org/abs/2604.00014
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce UTCO (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.

Similar Items