Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Sijia, Li, Xiaomin, Zhang, Mengxue, Jiang, Eric Hanchen, Zeng, Qingcheng, Yu, Chen-Hsiang
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2505.11413
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909613030375424
author	Chen, Sijia Li, Xiaomin Zhang, Mengxue Jiang, Eric Hanchen Zeng, Qingcheng Yu, Chen-Hsiang
author_facet	Chen, Sijia Li, Xiaomin Zhang, Mengxue Jiang, Eric Hanchen Zeng, Qingcheng Yu, Chen-Hsiang
contents	Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles: direct, indirect, obfuscated, and role-play, to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries. Finally, we propose a mitigation strategy using a lightweight classifier to detect jailbreak attempts and steer models toward safer behavior via reminder-based conditioning. CARES provides a rigorous framework for testing and improving medical LLM safety under adversarial and ambiguous conditions.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_11413
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs Chen, Sijia Li, Xiaomin Zhang, Mengxue Jiang, Eric Hanchen Zeng, Qingcheng Yu, Chen-Hsiang Computation and Language Large language models (LLMs) are increasingly deployed in medical contexts, raising critical concerns about safety, alignment, and susceptibility to adversarial manipulation. While prior benchmarks assess model refusal capabilities for harmful prompts, they often lack clinical specificity, graded harmfulness levels, and coverage of jailbreak-style attacks. We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating LLM safety in healthcare. CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles: direct, indirect, obfuscated, and role-play, to simulate both malicious and benign use cases. We propose a three-way response evaluation protocol (Accept, Caution, Refuse) and a fine-grained Safety Score metric to assess model behavior. Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries. Finally, we propose a mitigation strategy using a lightweight classifier to detect jailbreak attempts and steer models toward safer behavior via reminder-based conditioning. CARES provides a rigorous framework for testing and improving medical LLM safety under adversarial and ambiguous conditions.
title	CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs
topic	Computation and Language
url	https://arxiv.org/abs/2505.11413

Similar Items