Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2026
|
| Online Access: | https://doi.org/10.5281/zenodo.19600308 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <p class="western"><span>Large language models (LLMs) are increasingly considered for clinical applications, yet their reliability remains contested. A widely cited concern is that LLMs lack the metacognitive capacity required for safe clinical reasoning, based largely on performance in constrained benchmark evaluations. We tested this claim under ecologically valid conditions using an interactive clinical simulation designed to reflect real-world decision-making.</span></p> <p class="western"><span>Six state-of-the-art LLMs were evaluated across fifteen cases spanning acute emergencies, geriatric complexity, psychiatric presentations, and end-of-life care. The simulation required models to operate under incomplete and potentially unreliable information, manage competing clinical and ethical demands, provide psychological support to staff, and adapt reasoning to situational constraints. Performance was assessed using predefined, preregistered criteria capturing case resolution, critical safety demands, and standard clinical judgment.</span></p> <p class="western"><span>All six models met all predefined performance thresholds. No model failed any critical safety demand, and five of six resolved all cases. Across models, standard clinical performance ranged from 85% to 99%. In a detailed case analysis, models demonstrated both targeted information-seeking and direct inference of latent clinical context, consistent with metacognitive reasoning under uncertainty.</span></p> <p class="western"><span>These findings provide direct counterexamples to the claim that LLMs lack clinically relevant metacognition. They indicate that this capacity is present and can be expressed under conditions that permit it. More broadly, the results suggest that commonly used evaluation formats may underestimate LLM capabilities by measuring properties of the evaluation design rather than properties of the systems themselves.</span></p> <p class="western"><br><br></p>