Gespeichert in:
| Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Veröffentlicht: |
2025
|
| Schlagworte: | |
| Online-Zugang: | https://arxiv.org/abs/2505.23802 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| _version_ | 1866909631663570944 |
|---|---|
| author | Bedi, Suhana Cui, Hejie Fuentes, Miguel Unell, Alyssa Wornow, Michael Banda, Juan M. Kotecha, Nikesh Keyes, Timothy Mai, Yifan Oez, Mert Qiu, Hao Jain, Shrey Schettini, Leonardo Kashyap, Mehr Fries, Jason Alan Swaminathan, Akshay Chung, Philip Nateghi, Fateme Aali, Asad Nayak, Ashwin Vedak, Shivam Jain, Sneha S. Patel, Birju Fayanju, Oluseyi Shah, Shreya Goh, Ethan Yao, Dong-han Soetikno, Brian Reis, Eduardo Gatidis, Sergios Divi, Vasu Capasso, Robson Saralkar, Rachna Chiang, Chia-Chun Jindal, Jenelle Pham, Tho Ghoddusi, Faraz Lin, Steven Chiou, Albert S. Hong, Christy Roy, Mohana Gensheimer, Michael F. Patel, Hinesh Schulman, Kevin Dash, Dev Char, Danton Downing, Lance Grolleau, Francois Black, Kameron Mieso, Bethel Zahedivash, Aydin Yim, Wen-wai Sharma, Harshita Lee, Tony Kirsch, Hannah Lee, Jennifer Ambers, Nerissa Lugtu, Carlene Sharma, Aditya Mawji, Bilal Alekseyev, Alex Zhou, Vicky Kakkar, Vikas Helzer, Jarrod Revri, Anurang Bannett, Yair Daneshjou, Roxana Chen, Jonathan Alsentzer, Emily Morse, Keith Ravi, Nirmal Aghaeepour, Nima Kennedy, Vanessa Chaudhari, Akshay Wang, Thomas Koyejo, Sanmi Lungren, Matthew P. Horvitz, Eric Liang, Percy Pfeffer, Mike Shah, Nigam H. |
| author_facet | Bedi, Suhana Cui, Hejie Fuentes, Miguel Unell, Alyssa Wornow, Michael Banda, Juan M. Kotecha, Nikesh Keyes, Timothy Mai, Yifan Oez, Mert Qiu, Hao Jain, Shrey Schettini, Leonardo Kashyap, Mehr Fries, Jason Alan Swaminathan, Akshay Chung, Philip Nateghi, Fateme Aali, Asad Nayak, Ashwin Vedak, Shivam Jain, Sneha S. Patel, Birju Fayanju, Oluseyi Shah, Shreya Goh, Ethan Yao, Dong-han Soetikno, Brian Reis, Eduardo Gatidis, Sergios Divi, Vasu Capasso, Robson Saralkar, Rachna Chiang, Chia-Chun Jindal, Jenelle Pham, Tho Ghoddusi, Faraz Lin, Steven Chiou, Albert S. Hong, Christy Roy, Mohana Gensheimer, Michael F. Patel, Hinesh Schulman, Kevin Dash, Dev Char, Danton Downing, Lance Grolleau, Francois Black, Kameron Mieso, Bethel Zahedivash, Aydin Yim, Wen-wai Sharma, Harshita Lee, Tony Kirsch, Hannah Lee, Jennifer Ambers, Nerissa Lugtu, Carlene Sharma, Aditya Mawji, Bilal Alekseyev, Alex Zhou, Vicky Kakkar, Vikas Helzer, Jarrod Revri, Anurang Bannett, Yair Daneshjou, Roxana Chen, Jonathan Alsentzer, Emily Morse, Keith Ravi, Nirmal Aghaeepour, Nima Kennedy, Vanessa Chaudhari, Akshay Wang, Thomas Koyejo, Sanmi Lungren, Matthew P. Horvitz, Eric Liang, Percy Pfeffer, Mike Shah, Nigam H. |
| contents | While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2505_23802 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks Bedi, Suhana Cui, Hejie Fuentes, Miguel Unell, Alyssa Wornow, Michael Banda, Juan M. Kotecha, Nikesh Keyes, Timothy Mai, Yifan Oez, Mert Qiu, Hao Jain, Shrey Schettini, Leonardo Kashyap, Mehr Fries, Jason Alan Swaminathan, Akshay Chung, Philip Nateghi, Fateme Aali, Asad Nayak, Ashwin Vedak, Shivam Jain, Sneha S. Patel, Birju Fayanju, Oluseyi Shah, Shreya Goh, Ethan Yao, Dong-han Soetikno, Brian Reis, Eduardo Gatidis, Sergios Divi, Vasu Capasso, Robson Saralkar, Rachna Chiang, Chia-Chun Jindal, Jenelle Pham, Tho Ghoddusi, Faraz Lin, Steven Chiou, Albert S. Hong, Christy Roy, Mohana Gensheimer, Michael F. Patel, Hinesh Schulman, Kevin Dash, Dev Char, Danton Downing, Lance Grolleau, Francois Black, Kameron Mieso, Bethel Zahedivash, Aydin Yim, Wen-wai Sharma, Harshita Lee, Tony Kirsch, Hannah Lee, Jennifer Ambers, Nerissa Lugtu, Carlene Sharma, Aditya Mawji, Bilal Alekseyev, Alex Zhou, Vicky Kakkar, Vikas Helzer, Jarrod Revri, Anurang Bannett, Yair Daneshjou, Roxana Chen, Jonathan Alsentzer, Emily Morse, Keith Ravi, Nirmal Aghaeepour, Nima Kennedy, Vanessa Chaudhari, Akshay Wang, Thomas Koyejo, Sanmi Lungren, Matthew P. Horvitz, Eric Liang, Percy Pfeffer, Mike Shah, Nigam H. Computation and Language Artificial Intelligence While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this. |
| title | MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks |
| topic | Computation and Language Artificial Intelligence |
| url | https://arxiv.org/abs/2505.23802 |