_version_ 1866909735023804416
author Wang, Shirui
Tang, Zhihui
Yang, Huaxia
Gong, Qiuhong
Gu, Tiantian
Ma, Hongyang
Wang, Yongxin
Sun, Wubin
Lian, Zeliang
Mao, Kehang
Jiang, Yinan
Huang, Zhicheng
Ma, Lingyun
Shen, Wenjie
Ji, Yajie
Tan, Yunhui
Wang, Chunbo
Gao, Yunlu
Ye, Qianling
Lin, Rui
Chen, Mingyu
Niu, Lijuan
Wang, Zhihao
Yu, Peng
Lang, Mengran
Liu, Yue
Zhang, Huimin
Shen, Haitao
Chen, Long
Zhao, Qiguang
Liu, Si-Xuan
Zhou, Lina
Gao, Hua
Ye, Dongqiang
Meng, Lingmin
Yu, Youtao
Liang, Naixin
Wu, Jianxiong
author_facet Wang, Shirui
Tang, Zhihui
Yang, Huaxia
Gong, Qiuhong
Gu, Tiantian
Ma, Hongyang
Wang, Yongxin
Sun, Wubin
Lian, Zeliang
Mao, Kehang
Jiang, Yinan
Huang, Zhicheng
Ma, Lingyun
Shen, Wenjie
Ji, Yajie
Tan, Yunhui
Wang, Chunbo
Gao, Yunlu
Ye, Qianling
Lin, Rui
Chen, Mingyu
Niu, Lijuan
Wang, Zhihao
Yu, Peng
Lang, Mengran
Liu, Yue
Zhang, Huimin
Shen, Haitao
Chen, Long
Zhao, Qiguang
Liu, Si-Xuan
Zhou, Lina
Gao, Hua
Ye, Dongqiang
Meng, Lingmin
Yu, Youtao
Liang, Naixin
Wu, Jianxiong
contents Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
format Preprint
id arxiv_https___arxiv_org_abs_2507_23486
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
Wang, Shirui
Tang, Zhihui
Yang, Huaxia
Gong, Qiuhong
Gu, Tiantian
Ma, Hongyang
Wang, Yongxin
Sun, Wubin
Lian, Zeliang
Mao, Kehang
Jiang, Yinan
Huang, Zhicheng
Ma, Lingyun
Shen, Wenjie
Ji, Yajie
Tan, Yunhui
Wang, Chunbo
Gao, Yunlu
Ye, Qianling
Lin, Rui
Chen, Mingyu
Niu, Lijuan
Wang, Zhihao
Yu, Peng
Lang, Mengran
Liu, Yue
Zhang, Huimin
Shen, Haitao
Chen, Long
Zhao, Qiguang
Liu, Si-Xuan
Zhou, Lina
Gao, Hua
Ye, Dongqiang
Meng, Lingmin
Yu, Youtao
Liang, Naixin
Wu, Jianxiong
Computation and Language
Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
title A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
topic Computation and Language
url https://arxiv.org/abs/2507.23486