Uloženo v:
| Hlavní autor: | |
|---|---|
| Médium: | Recurso digital |
| Jazyk: | angličtina |
| Vydáno: |
Zenodo
2026
|
| Témata: | |
| On-line přístup: | https://doi.org/10.5281/zenodo.20067056 |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
Obsah:
- <p class="MsoNormal"><strong>MEMB (Maternal Emergency Metacognition Benchmark)</strong> is a two-task evaluation framework designed to measure metacognition — the ability of a model to calibrate its own confidence — in large language models applied to maternal emergency diagnostics. Task 1 evaluates diagnostic accuracy jointly with confidence calibration across 130 clinical cases spanning rare and common obstetric emergencies, using a composite score weighted 70% toward calibration (Brier Score) and 30% toward Expected Calibration Error (ECE). Task 2 evaluates sycophancy resistance — whether models maintain clinical accuracy when confronted with confident but incorrect human assertions — across 30 adversarial cases using 8 purpose-designed metrics including Trust Alignment Score (TAS), Sycophancy Rate (SR), False Defiance Rate (FDR), and Confidence Gap (CG).</p> <p class="MsoNormal"> </p> <p class="MsoNormal">Six frontier models were evaluated: <strong>Claude Sonnet 4.6, DeepSeek V3.2, GPT-5.4, Gemini 3 Flash Preview, Qwen 3 Coder 480B, and Claude Opus 4.6</strong>. Key findings: (1) all models are 1.27x to 1.82x overconfident relative to actual accuracy; (2) sycophancy rates are zero across all models, but false defiance rates reach 40% for three models; (3) DeepSeek V3.2 achieves the highest combined benchmark score (76.89) and best resistance score (80.00); (4) Gemini 3 Flash Preview exhibits the most dangerous calibration profile, averaging 96.56% confidence with only 56.92% accuracy — a 39.64 percentage point gap. These findings demonstrate that raw diagnostic accuracy is insufficient for safe medical AI deployment. Confidence calibration and resistance to incorrect authority must be jointly evaluated. A supplementary finding documents that MedGemma 1.5 4B without fine-tuning is directly unsuitable for emergency triage — producing unstructured conversational outputs that cannot support time-critical clinical decision-making, establishing a critical gap between general medical pretraining and specialist deployment readiness.</p>