Saved in:
Bibliographic Details
Main Authors: Ni, Congning, Qadir, Sarvech, Steitz, Bryan, Vaidya, Mihir Sachin, Song, Qingyuan, Xia, Lantian, Mulvaney, Shelagh, Liu, Siru, Ryu, Hyeyoung, Hecht, Leah, Bucher, Amy, Symons, Christopher, Novak, Laurie, Rose, Susannah L., Kantarcioglu, Murat, Malin, Bradley, Yin, Zhijun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.00014
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce UTCO (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.