Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Bedi, Suhana, Cui, Hejie, Fuentes, Miguel, Unell, Alyssa, Wornow, Michael, Banda, Juan M., Kotecha, Nikesh, Keyes, Timothy, Mai, Yifan, Oez, Mert, Qiu, Hao, Jain, Shrey, Schettini, Leonardo, Kashyap, Mehr, Fries, Jason Alan, Swaminathan, Akshay, Chung, Philip, Nateghi, Fateme, Aali, Asad, Nayak, Ashwin, Vedak, Shivam, Jain, Sneha S., Patel, Birju, Fayanju, Oluseyi, Shah, Shreya, Goh, Ethan, Yao, Dong-han, Soetikno, Brian, Reis, Eduardo, Gatidis, Sergios, Divi, Vasu, Capasso, Robson, Saralkar, Rachna, Chiang, Chia-Chun, Jindal, Jenelle, Pham, Tho, Ghoddusi, Faraz, Lin, Steven, Chiou, Albert S., Hong, Christy, Roy, Mohana, Gensheimer, Michael F., Patel, Hinesh, Schulman, Kevin, Dash, Dev, Char, Danton, Downing, Lance, Grolleau, Francois, Black, Kameron, Mieso, Bethel, Zahedivash, Aydin, Yim, Wen-wai, Sharma, Harshita, Lee, Tony, Kirsch, Hannah, Lee, Jennifer, Ambers, Nerissa, Lugtu, Carlene, Sharma, Aditya, Mawji, Bilal, Alekseyev, Alex, Zhou, Vicky, Kakkar, Vikas, Helzer, Jarrod, Revri, Anurang, Bannett, Yair, Daneshjou, Roxana, Chen, Jonathan, Alsentzer, Emily, Morse, Keith, Ravi, Nirmal, Aghaeepour, Nima, Kennedy, Vanessa, Chaudhari, Akshay, Wang, Thomas, Koyejo, Sanmi, Lungren, Matthew P., Horvitz, Eric, Liang, Percy, Pfeffer, Mike, Shah, Nigam H.
Format: Preprint
Veröffentlicht: 2025
Schlagworte:
Online-Zugang:https://arxiv.org/abs/2505.23802
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1866909631663570944
author Bedi, Suhana
Cui, Hejie
Fuentes, Miguel
Unell, Alyssa
Wornow, Michael
Banda, Juan M.
Kotecha, Nikesh
Keyes, Timothy
Mai, Yifan
Oez, Mert
Qiu, Hao
Jain, Shrey
Schettini, Leonardo
Kashyap, Mehr
Fries, Jason Alan
Swaminathan, Akshay
Chung, Philip
Nateghi, Fateme
Aali, Asad
Nayak, Ashwin
Vedak, Shivam
Jain, Sneha S.
Patel, Birju
Fayanju, Oluseyi
Shah, Shreya
Goh, Ethan
Yao, Dong-han
Soetikno, Brian
Reis, Eduardo
Gatidis, Sergios
Divi, Vasu
Capasso, Robson
Saralkar, Rachna
Chiang, Chia-Chun
Jindal, Jenelle
Pham, Tho
Ghoddusi, Faraz
Lin, Steven
Chiou, Albert S.
Hong, Christy
Roy, Mohana
Gensheimer, Michael F.
Patel, Hinesh
Schulman, Kevin
Dash, Dev
Char, Danton
Downing, Lance
Grolleau, Francois
Black, Kameron
Mieso, Bethel
Zahedivash, Aydin
Yim, Wen-wai
Sharma, Harshita
Lee, Tony
Kirsch, Hannah
Lee, Jennifer
Ambers, Nerissa
Lugtu, Carlene
Sharma, Aditya
Mawji, Bilal
Alekseyev, Alex
Zhou, Vicky
Kakkar, Vikas
Helzer, Jarrod
Revri, Anurang
Bannett, Yair
Daneshjou, Roxana
Chen, Jonathan
Alsentzer, Emily
Morse, Keith
Ravi, Nirmal
Aghaeepour, Nima
Kennedy, Vanessa
Chaudhari, Akshay
Wang, Thomas
Koyejo, Sanmi
Lungren, Matthew P.
Horvitz, Eric
Liang, Percy
Pfeffer, Mike
Shah, Nigam H.
author_facet Bedi, Suhana
Cui, Hejie
Fuentes, Miguel
Unell, Alyssa
Wornow, Michael
Banda, Juan M.
Kotecha, Nikesh
Keyes, Timothy
Mai, Yifan
Oez, Mert
Qiu, Hao
Jain, Shrey
Schettini, Leonardo
Kashyap, Mehr
Fries, Jason Alan
Swaminathan, Akshay
Chung, Philip
Nateghi, Fateme
Aali, Asad
Nayak, Ashwin
Vedak, Shivam
Jain, Sneha S.
Patel, Birju
Fayanju, Oluseyi
Shah, Shreya
Goh, Ethan
Yao, Dong-han
Soetikno, Brian
Reis, Eduardo
Gatidis, Sergios
Divi, Vasu
Capasso, Robson
Saralkar, Rachna
Chiang, Chia-Chun
Jindal, Jenelle
Pham, Tho
Ghoddusi, Faraz
Lin, Steven
Chiou, Albert S.
Hong, Christy
Roy, Mohana
Gensheimer, Michael F.
Patel, Hinesh
Schulman, Kevin
Dash, Dev
Char, Danton
Downing, Lance
Grolleau, Francois
Black, Kameron
Mieso, Bethel
Zahedivash, Aydin
Yim, Wen-wai
Sharma, Harshita
Lee, Tony
Kirsch, Hannah
Lee, Jennifer
Ambers, Nerissa
Lugtu, Carlene
Sharma, Aditya
Mawji, Bilal
Alekseyev, Alex
Zhou, Vicky
Kakkar, Vikas
Helzer, Jarrod
Revri, Anurang
Bannett, Yair
Daneshjou, Roxana
Chen, Jonathan
Alsentzer, Emily
Morse, Keith
Ravi, Nirmal
Aghaeepour, Nima
Kennedy, Vanessa
Chaudhari, Akshay
Wang, Thomas
Koyejo, Sanmi
Lungren, Matthew P.
Horvitz, Eric
Liang, Percy
Pfeffer, Mike
Shah, Nigam H.
contents While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
format Preprint
id arxiv_https___arxiv_org_abs_2505_23802
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Bedi, Suhana
Cui, Hejie
Fuentes, Miguel
Unell, Alyssa
Wornow, Michael
Banda, Juan M.
Kotecha, Nikesh
Keyes, Timothy
Mai, Yifan
Oez, Mert
Qiu, Hao
Jain, Shrey
Schettini, Leonardo
Kashyap, Mehr
Fries, Jason Alan
Swaminathan, Akshay
Chung, Philip
Nateghi, Fateme
Aali, Asad
Nayak, Ashwin
Vedak, Shivam
Jain, Sneha S.
Patel, Birju
Fayanju, Oluseyi
Shah, Shreya
Goh, Ethan
Yao, Dong-han
Soetikno, Brian
Reis, Eduardo
Gatidis, Sergios
Divi, Vasu
Capasso, Robson
Saralkar, Rachna
Chiang, Chia-Chun
Jindal, Jenelle
Pham, Tho
Ghoddusi, Faraz
Lin, Steven
Chiou, Albert S.
Hong, Christy
Roy, Mohana
Gensheimer, Michael F.
Patel, Hinesh
Schulman, Kevin
Dash, Dev
Char, Danton
Downing, Lance
Grolleau, Francois
Black, Kameron
Mieso, Bethel
Zahedivash, Aydin
Yim, Wen-wai
Sharma, Harshita
Lee, Tony
Kirsch, Hannah
Lee, Jennifer
Ambers, Nerissa
Lugtu, Carlene
Sharma, Aditya
Mawji, Bilal
Alekseyev, Alex
Zhou, Vicky
Kakkar, Vikas
Helzer, Jarrod
Revri, Anurang
Bannett, Yair
Daneshjou, Roxana
Chen, Jonathan
Alsentzer, Emily
Morse, Keith
Ravi, Nirmal
Aghaeepour, Nima
Kennedy, Vanessa
Chaudhari, Akshay
Wang, Thomas
Koyejo, Sanmi
Lungren, Matthew P.
Horvitz, Eric
Liang, Percy
Pfeffer, Mike
Shah, Nigam H.
Computation and Language
Artificial Intelligence
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
title MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
topic Computation and Language
Artificial Intelligence
url https://arxiv.org/abs/2505.23802