Obsah: :: Library Catalog

Uloženo v:

Podrobná bibliografie
Hlavní autor:	Khan, Nabeera
Médium:	Recurso digital
Jazyk:	angličtina
Vydáno:	Zenodo 2026
Témata:	Model benchmark Claude medical AI metacognition confidence calibration sycophancy Deepseek Qwen LLM evaluation Gemini maternal health AI Benchmark
On-line přístup:	https://doi.org/10.5281/zenodo.20067056
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Obsah:

MEMB (Maternal Emergency Metacognition Benchmark) is a two-task evaluation framework designed to measure metacognition — the ability of a model to calibrate its own confidence — in large language models applied to maternal emergency diagnostics. Task 1 evaluates diagnostic accuracy jointly with confidence calibration across 130 clinical cases spanning rare and common obstetric emergencies, using a composite score weighted 70% toward calibration (Brier Score) and 30% toward Expected Calibration Error (ECE). Task 2 evaluates sycophancy resistance — whether models maintain clinical accuracy when confronted with confident but incorrect human assertions — across 30 adversarial cases using 8 purpose-designed metrics including Trust Alignment Score (TAS), Sycophancy Rate (SR), False Defiance Rate (FDR), and Confidence Gap (CG).   Six frontier models were evaluated: Claude Sonnet 4.6, DeepSeek V3.2, GPT-5.4, Gemini 3 Flash Preview, Qwen 3 Coder 480B, and Claude Opus 4.6. Key findings: (1) all models are 1.27x to 1.82x overconfident relative to actual accuracy; (2) sycophancy rates are zero across all models, but false defiance rates reach 40% for three models; (3) DeepSeek V3.2 achieves the highest combined benchmark score (76.89) and best resistance score (80.00); (4) Gemini 3 Flash Preview exhibits the most dangerous calibration profile, averaging 96.56% confidence with only 56.92% accuracy — a 39.64 percentage point gap. These findings demonstrate that raw diagnostic accuracy is insufficient for safe medical AI deployment. Confidence calibration and resistance to incorrect authority must be jointly evaluated. A supplementary finding documents that MedGemma 1.5 4B without fine-tuning is directly unsuitable for emergency triage — producing unstructured conversational outputs that cannot support time-critical clinical decision-making, establishing a critical gap between general medical pretraining and specialist deployment readiness.

Podobné jednotky