Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Shirui, Tang, Zhihui, Yang, Huaxia, Gong, Qiuhong, Gu, Tiantian, Ma, Hongyang, Wang, Yongxin, Sun, Wubin, Lian, Zeliang, Mao, Kehang, Jiang, Yinan, Huang, Zhicheng, Ma, Lingyun, Shen, Wenjie, Ji, Yajie, Tan, Yunhui, Wang, Chunbo, Gao, Yunlu, Ye, Qianling, Lin, Rui, Chen, Mingyu, Niu, Lijuan, Wang, Zhihao, Yu, Peng, Lang, Mengran, Liu, Yue, Zhang, Huimin, Shen, Haitao, Chen, Long, Zhao, Qiguang, Liu, Si-Xuan, Zhou, Lina, Gao, Hua, Ye, Dongqiang, Meng, Lingmin, Yu, Youtao, Liang, Naixin, Wu, Jianxiong
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2507.23486
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909735023804416
author	Wang, Shirui Tang, Zhihui Yang, Huaxia Gong, Qiuhong Gu, Tiantian Ma, Hongyang Wang, Yongxin Sun, Wubin Lian, Zeliang Mao, Kehang Jiang, Yinan Huang, Zhicheng Ma, Lingyun Shen, Wenjie Ji, Yajie Tan, Yunhui Wang, Chunbo Gao, Yunlu Ye, Qianling Lin, Rui Chen, Mingyu Niu, Lijuan Wang, Zhihao Yu, Peng Lang, Mengran Liu, Yue Zhang, Huimin Shen, Haitao Chen, Long Zhao, Qiguang Liu, Si-Xuan Zhou, Lina Gao, Hua Ye, Dongqiang Meng, Lingmin Yu, Youtao Liang, Naixin Wu, Jianxiong
author_facet	Wang, Shirui Tang, Zhihui Yang, Huaxia Gong, Qiuhong Gu, Tiantian Ma, Hongyang Wang, Yongxin Sun, Wubin Lian, Zeliang Mao, Kehang Jiang, Yinan Huang, Zhicheng Ma, Lingyun Shen, Wenjie Ji, Yajie Tan, Yunhui Wang, Chunbo Gao, Yunlu Ye, Qianling Lin, Rui Chen, Mingyu Niu, Lijuan Wang, Zhihao Yu, Peng Lang, Mengran Liu, Yue Zhang, Huimin Shen, Haitao Chen, Long Zhao, Qiguang Liu, Si-Xuan Zhou, Lina Gao, Hua Ye, Dongqiang Meng, Lingmin Yu, Youtao Liang, Naixin Wu, Jianxiong
contents	Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_23486
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains Wang, Shirui Tang, Zhihui Yang, Huaxia Gong, Qiuhong Gu, Tiantian Ma, Hongyang Wang, Yongxin Sun, Wubin Lian, Zeliang Mao, Kehang Jiang, Yinan Huang, Zhicheng Ma, Lingyun Shen, Wenjie Ji, Yajie Tan, Yunhui Wang, Chunbo Gao, Yunlu Ye, Qianling Lin, Rui Chen, Mingyu Niu, Lijuan Wang, Zhihao Yu, Peng Lang, Mengran Liu, Yue Zhang, Huimin Shen, Haitao Chen, Long Zhao, Qiguang Liu, Si-Xuan Zhou, Lina Gao, Hua Ye, Dongqiang Meng, Lingmin Yu, Youtao Liang, Naixin Wu, Jianxiong Computation and Language Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.
title	A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains
topic	Computation and Language
url	https://arxiv.org/abs/2507.23486

Similar Items