_version_ 1866912772385669120
author Wu, David
Haredasht, Fateme Nateghi
Maharaj, Saloni Kumar
Jain, Priyank
Tran, Jessica
Gwiazdon, Matthew
Rustagi, Arjun
Jindal, Jenelle
Koshy, Jacob M.
Kadiyala, Vinay
Agarwal, Anup
Tappuni, Bassman
French, Brianna
Jesudasen, Sirus
Cosgriff, Christopher V.
Chakraborty, Rebanta
Caldwell, Jillian
Ziolkowski, Susan
Iberri, David J.
Diep, Robert
Dalal, Rahul S.
Newman, Kira L.
Galetta, Kristin
Pallais, J. Carl
Wei, Nancy
Buchheit, Kathleen M.
Hong, David I.
Lee, Ernest Y.
Shih, Allen
Pahalyants, Vartan
Kaplan, Tamara B.
Ravi, Vishnu
Khemani, Sarita
Liang, April S.
Shirvani, Daniel
Patil, Advait
Marshall, Nicholas
Chopra, Kanav
Koh, Joel
Badhwar, Adi
McCoy, Liam G.
Wu, David J. H.
Weng, Yingjie
Ranji, Sumant
Schulman, Kevin
Shah, Nigam H.
Hom, Jason
Milstein, Arnold
Rodman, Adam
Chen, Jonathan H.
Goh, Ethan
author_facet Wu, David
Haredasht, Fateme Nateghi
Maharaj, Saloni Kumar
Jain, Priyank
Tran, Jessica
Gwiazdon, Matthew
Rustagi, Arjun
Jindal, Jenelle
Koshy, Jacob M.
Kadiyala, Vinay
Agarwal, Anup
Tappuni, Bassman
French, Brianna
Jesudasen, Sirus
Cosgriff, Christopher V.
Chakraborty, Rebanta
Caldwell, Jillian
Ziolkowski, Susan
Iberri, David J.
Diep, Robert
Dalal, Rahul S.
Newman, Kira L.
Galetta, Kristin
Pallais, J. Carl
Wei, Nancy
Buchheit, Kathleen M.
Hong, David I.
Lee, Ernest Y.
Shih, Allen
Pahalyants, Vartan
Kaplan, Tamara B.
Ravi, Vishnu
Khemani, Sarita
Liang, April S.
Shirvani, Daniel
Patil, Advait
Marshall, Nicholas
Chopra, Kanav
Koh, Joel
Badhwar, Adi
McCoy, Liam G.
Wu, David J. H.
Weng, Yingjie
Ranji, Sumant
Schulman, Kevin
Shah, Nigam H.
Hom, Jason
Milstein, Arnold
Rodman, Adam
Chen, Jonathan H.
Goh, Ethan
contents Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary care-to-specialist consultation cases to measure frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, potential for severe harm from LLM recommendations occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harm of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach improves safety compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.
format Preprint
id arxiv_https___arxiv_org_abs_2512_01241
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle First, do NOHARM: towards clinically safe large language models
Wu, David
Haredasht, Fateme Nateghi
Maharaj, Saloni Kumar
Jain, Priyank
Tran, Jessica
Gwiazdon, Matthew
Rustagi, Arjun
Jindal, Jenelle
Koshy, Jacob M.
Kadiyala, Vinay
Agarwal, Anup
Tappuni, Bassman
French, Brianna
Jesudasen, Sirus
Cosgriff, Christopher V.
Chakraborty, Rebanta
Caldwell, Jillian
Ziolkowski, Susan
Iberri, David J.
Diep, Robert
Dalal, Rahul S.
Newman, Kira L.
Galetta, Kristin
Pallais, J. Carl
Wei, Nancy
Buchheit, Kathleen M.
Hong, David I.
Lee, Ernest Y.
Shih, Allen
Pahalyants, Vartan
Kaplan, Tamara B.
Ravi, Vishnu
Khemani, Sarita
Liang, April S.
Shirvani, Daniel
Patil, Advait
Marshall, Nicholas
Chopra, Kanav
Koh, Joel
Badhwar, Adi
McCoy, Liam G.
Wu, David J. H.
Weng, Yingjie
Ranji, Sumant
Schulman, Kevin
Shah, Nigam H.
Hom, Jason
Milstein, Arnold
Rodman, Adam
Chen, Jonathan H.
Goh, Ethan
Computers and Society
Artificial Intelligence
Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary care-to-specialist consultation cases to measure frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, potential for severe harm from LLM recommendations occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harm of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach improves safety compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.
title First, do NOHARM: towards clinically safe large language models
topic Computers and Society
Artificial Intelligence
url https://arxiv.org/abs/2512.01241