Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Bedi, Suhana, Cui, Hejie, Fuentes, Miguel, Unell, Alyssa, Wornow, Michael, Banda, Juan M., Kotecha, Nikesh, Keyes, Timothy, Mai, Yifan, Oez, Mert, Qiu, Hao, Jain, Shrey, Schettini, Leonardo, Kashyap, Mehr, Fries, Jason Alan, Swaminathan, Akshay, Chung, Philip, Nateghi, Fateme, Aali, Asad, Nayak, Ashwin, Vedak, Shivam, Jain, Sneha S., Patel, Birju, Fayanju, Oluseyi, Shah, Shreya, Goh, Ethan, Yao, Dong-han, Soetikno, Brian, Reis, Eduardo, Gatidis, Sergios, Divi, Vasu, Capasso, Robson, Saralkar, Rachna, Chiang, Chia-Chun, Jindal, Jenelle, Pham, Tho, Ghoddusi, Faraz, Lin, Steven, Chiou, Albert S., Hong, Christy, Roy, Mohana, Gensheimer, Michael F., Patel, Hinesh, Schulman, Kevin, Dash, Dev, Char, Danton, Downing, Lance, Grolleau, Francois, Black, Kameron, Mieso, Bethel, Zahedivash, Aydin, Yim, Wen-wai, Sharma, Harshita, Lee, Tony, Kirsch, Hannah, Lee, Jennifer, Ambers, Nerissa, Lugtu, Carlene, Sharma, Aditya, Mawji, Bilal, Alekseyev, Alex, Zhou, Vicky, Kakkar, Vikas, Helzer, Jarrod, Revri, Anurang, Bannett, Yair, Daneshjou, Roxana, Chen, Jonathan, Alsentzer, Emily, Morse, Keith, Ravi, Nirmal, Aghaeepour, Nima, Kennedy, Vanessa, Chaudhari, Akshay, Wang, Thomas, Koyejo, Sanmi, Lungren, Matthew P., Horvitz, Eric, Liang, Percy, Pfeffer, Mike, Shah, Nigam H.
Format:	Preprint
Veröffentlicht:	2025
Schlagworte:	Computation and Language Artificial Intelligence
Online-Zugang:	https://arxiv.org/abs/2505.23802
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866909631663570944
author	Bedi, Suhana Cui, Hejie Fuentes, Miguel Unell, Alyssa Wornow, Michael Banda, Juan M. Kotecha, Nikesh Keyes, Timothy Mai, Yifan Oez, Mert Qiu, Hao Jain, Shrey Schettini, Leonardo Kashyap, Mehr Fries, Jason Alan Swaminathan, Akshay Chung, Philip Nateghi, Fateme Aali, Asad Nayak, Ashwin Vedak, Shivam Jain, Sneha S. Patel, Birju Fayanju, Oluseyi Shah, Shreya Goh, Ethan Yao, Dong-han Soetikno, Brian Reis, Eduardo Gatidis, Sergios Divi, Vasu Capasso, Robson Saralkar, Rachna Chiang, Chia-Chun Jindal, Jenelle Pham, Tho Ghoddusi, Faraz Lin, Steven Chiou, Albert S. Hong, Christy Roy, Mohana Gensheimer, Michael F. Patel, Hinesh Schulman, Kevin Dash, Dev Char, Danton Downing, Lance Grolleau, Francois Black, Kameron Mieso, Bethel Zahedivash, Aydin Yim, Wen-wai Sharma, Harshita Lee, Tony Kirsch, Hannah Lee, Jennifer Ambers, Nerissa Lugtu, Carlene Sharma, Aditya Mawji, Bilal Alekseyev, Alex Zhou, Vicky Kakkar, Vikas Helzer, Jarrod Revri, Anurang Bannett, Yair Daneshjou, Roxana Chen, Jonathan Alsentzer, Emily Morse, Keith Ravi, Nirmal Aghaeepour, Nima Kennedy, Vanessa Chaudhari, Akshay Wang, Thomas Koyejo, Sanmi Lungren, Matthew P. Horvitz, Eric Liang, Percy Pfeffer, Mike Shah, Nigam H.
author_facet	Bedi, Suhana Cui, Hejie Fuentes, Miguel Unell, Alyssa Wornow, Michael Banda, Juan M. Kotecha, Nikesh Keyes, Timothy Mai, Yifan Oez, Mert Qiu, Hao Jain, Shrey Schettini, Leonardo Kashyap, Mehr Fries, Jason Alan Swaminathan, Akshay Chung, Philip Nateghi, Fateme Aali, Asad Nayak, Ashwin Vedak, Shivam Jain, Sneha S. Patel, Birju Fayanju, Oluseyi Shah, Shreya Goh, Ethan Yao, Dong-han Soetikno, Brian Reis, Eduardo Gatidis, Sergios Divi, Vasu Capasso, Robson Saralkar, Rachna Chiang, Chia-Chun Jindal, Jenelle Pham, Tho Ghoddusi, Faraz Lin, Steven Chiou, Albert S. Hong, Christy Roy, Mohana Gensheimer, Michael F. Patel, Hinesh Schulman, Kevin Dash, Dev Char, Danton Downing, Lance Grolleau, Francois Black, Kameron Mieso, Bethel Zahedivash, Aydin Yim, Wen-wai Sharma, Harshita Lee, Tony Kirsch, Hannah Lee, Jennifer Ambers, Nerissa Lugtu, Carlene Sharma, Aditya Mawji, Bilal Alekseyev, Alex Zhou, Vicky Kakkar, Vikas Helzer, Jarrod Revri, Anurang Bannett, Yair Daneshjou, Roxana Chen, Jonathan Alsentzer, Emily Morse, Keith Ravi, Nirmal Aghaeepour, Nima Kennedy, Vanessa Chaudhari, Akshay Wang, Thomas Koyejo, Sanmi Lungren, Matthew P. Horvitz, Eric Liang, Percy Pfeffer, Mike Shah, Nigam H.
contents	While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_23802
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks Bedi, Suhana Cui, Hejie Fuentes, Miguel Unell, Alyssa Wornow, Michael Banda, Juan M. Kotecha, Nikesh Keyes, Timothy Mai, Yifan Oez, Mert Qiu, Hao Jain, Shrey Schettini, Leonardo Kashyap, Mehr Fries, Jason Alan Swaminathan, Akshay Chung, Philip Nateghi, Fateme Aali, Asad Nayak, Ashwin Vedak, Shivam Jain, Sneha S. Patel, Birju Fayanju, Oluseyi Shah, Shreya Goh, Ethan Yao, Dong-han Soetikno, Brian Reis, Eduardo Gatidis, Sergios Divi, Vasu Capasso, Robson Saralkar, Rachna Chiang, Chia-Chun Jindal, Jenelle Pham, Tho Ghoddusi, Faraz Lin, Steven Chiou, Albert S. Hong, Christy Roy, Mohana Gensheimer, Michael F. Patel, Hinesh Schulman, Kevin Dash, Dev Char, Danton Downing, Lance Grolleau, Francois Black, Kameron Mieso, Bethel Zahedivash, Aydin Yim, Wen-wai Sharma, Harshita Lee, Tony Kirsch, Hannah Lee, Jennifer Ambers, Nerissa Lugtu, Carlene Sharma, Aditya Mawji, Bilal Alekseyev, Alex Zhou, Vicky Kakkar, Vikas Helzer, Jarrod Revri, Anurang Bannett, Yair Daneshjou, Roxana Chen, Jonathan Alsentzer, Emily Morse, Keith Ravi, Nirmal Aghaeepour, Nima Kennedy, Vanessa Chaudhari, Akshay Wang, Thomas Koyejo, Sanmi Lungren, Matthew P. Horvitz, Eric Liang, Percy Pfeffer, Mike Shah, Nigam H. Computation and Language Artificial Intelligence While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
title	MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2505.23802

Ähnliche Einträge