Saved in:
Bibliographic Details
Main Authors: Fang, Jinrui, Chen, Runhan, Yang, Xu, Yu, Jian, Xu, Jiawei, Vinod, Ashwin, Shi, Wenqi, Chen, Tianlong, Ji, Heng, Zhai, ChengXiang, Ding, Ying, Zhang, Yuji
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.04325
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915917625032704
author Fang, Jinrui
Chen, Runhan
Yang, Xu
Yu, Jian
Xu, Jiawei
Vinod, Ashwin
Shi, Wenqi
Chen, Tianlong
Ji, Heng
Zhai, ChengXiang
Ding, Ying
Zhang, Yuji
author_facet Fang, Jinrui
Chen, Runhan
Yang, Xu
Yu, Jian
Xu, Jiawei
Vinod, Ashwin
Shi, Wenqi
Chen, Tianlong
Ji, Heng
Zhai, ChengXiang
Ding, Ying
Zhang, Yuji
contents Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.
format Preprint
id arxiv_https___arxiv_org_abs_2604_04325
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction
Fang, Jinrui
Chen, Runhan
Yang, Xu
Yu, Jian
Xu, Jiawei
Vinod, Ashwin
Shi, Wenqi
Chen, Tianlong
Ji, Heng
Zhai, ChengXiang
Ding, Ying
Zhang, Yuji
Computation and Language
Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.
title Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction
topic Computation and Language
url https://arxiv.org/abs/2604.04325