Saved in:
Bibliographic Details
Main Authors: Zhao, Junjie, Liang, Jingyi, Cai, Zhenyang, Zhang, Jiaming, Wen, Zhenwei, Deng, Shuzhi, Yi, Wenjing, Luo, Chunfeng, Zhang, Hexian, Chen, Junying, Liu, Tianrui, Bai, Zhuhui, Zhang, Zixu, Singh, Pradeep, Liu, Xiang, Li, Jianquan, Tran, Nhan L, Schwendicke, Falk, Jin, Zuolin, Jin, Lijian, Chen, Liangyi, Yang, Wei-fa, Wang, Benyou, Wang, Junwen, Jiang, Shan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.24636
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910260273348608
author Zhao, Junjie
Liang, Jingyi
Cai, Zhenyang
Zhang, Jiaming
Wen, Zhenwei
Deng, Shuzhi
Yi, Wenjing
Luo, Chunfeng
Zhang, Hexian
Chen, Junying
Liu, Tianrui
Bai, Zhuhui
Zhang, Zixu
Singh, Pradeep
Liu, Xiang
Li, Jianquan
Tran, Nhan L
Schwendicke, Falk
Jin, Zuolin
Jin, Lijian
Chen, Liangyi
Yang, Wei-fa
Wang, Benyou
Wang, Junwen
Jiang, Shan
author_facet Zhao, Junjie
Liang, Jingyi
Cai, Zhenyang
Zhang, Jiaming
Wen, Zhenwei
Deng, Shuzhi
Yi, Wenjing
Luo, Chunfeng
Zhang, Hexian
Chen, Junying
Liu, Tianrui
Bai, Zhuhui
Zhang, Zixu
Singh, Pradeep
Liu, Xiang
Li, Jianquan
Tran, Nhan L
Schwendicke, Falk
Jin, Zuolin
Jin, Lijian
Chen, Liangyi
Yang, Wei-fa
Wang, Benyou
Wang, Junwen
Jiang, Shan
contents While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.
format Preprint
id arxiv_https___arxiv_org_abs_2605_24636
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
Zhao, Junjie
Liang, Jingyi
Cai, Zhenyang
Zhang, Jiaming
Wen, Zhenwei
Deng, Shuzhi
Yi, Wenjing
Luo, Chunfeng
Zhang, Hexian
Chen, Junying
Liu, Tianrui
Bai, Zhuhui
Zhang, Zixu
Singh, Pradeep
Liu, Xiang
Li, Jianquan
Tran, Nhan L
Schwendicke, Falk
Jin, Zuolin
Jin, Lijian
Chen, Liangyi
Yang, Wei-fa
Wang, Benyou
Wang, Junwen
Jiang, Shan
Artificial Intelligence
Computation and Language
While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.
title GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
topic Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2605.24636