Saved in:
Bibliographic Details
Main Authors: Yan, Weixiang, Liu, Haitian, Wu, Tengxiao, Chen, Qian, Wang, Wen, Chai, Haoyuan, Wang, Jiayi, Zhao, Weishan, Zhang, Yixin, Zhang, Renjun, Zhu, Li, Zhao, Xuandong
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.13890
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916430630354944
author Yan, Weixiang
Liu, Haitian
Wu, Tengxiao
Chen, Qian
Wang, Wen
Chai, Haoyuan
Wang, Jiayi
Zhao, Weishan
Zhang, Yixin
Zhang, Renjun
Zhu, Li
Zhao, Xuandong
author_facet Yan, Weixiang
Liu, Haitian
Wu, Tengxiao
Chen, Qian
Wang, Wen
Chai, Haoyuan
Wang, Jiayi
Zhao, Weishan
Zhang, Yixin
Zhang, Renjun
Zhu, Li
Zhao, Xuandong
contents LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.
format Preprint
id arxiv_https___arxiv_org_abs_2406_13890
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
Yan, Weixiang
Liu, Haitian
Wu, Tengxiao
Chen, Qian
Wang, Wen
Chai, Haoyuan
Wang, Jiayi
Zhao, Weishan
Zhang, Yixin
Zhang, Renjun
Zhu, Li
Zhao, Xuandong
Computation and Language
Artificial Intelligence
LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.
title ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2406.13890