Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nakao, Mahiro, Takemoto, Kazuhiro
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence Computers and Society Robotics
Online Access:	https://arxiv.org/abs/2604.26577
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915967619039232
author	Nakao, Mahiro Takemoto, Kazuhiro
author_facet	Nakao, Mahiro Takemoto, Kazuhiro
contents	Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7\% versus 72.8\%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_26577
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control Nakao, Mahiro Takemoto, Kazuhiro Artificial Intelligence Computers and Society Robotics Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7\% versus 72.8\%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.
title	Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
topic	Artificial Intelligence Computers and Society Robotics
url	https://arxiv.org/abs/2604.26577

Similar Items