Saved in:
Bibliographic Details
Main Authors: Lamparth, Max, Grabb, Declan, Franks, Amy, Gershan, Scott, Kunstman, Kaitlyn N., Lulla, Aaron, Roots, Monika Drummond, Sharma, Manu, Shrivastava, Aryan, Vasan, Nina, Waickman, Colleen
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.16051
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917277210771456
author Lamparth, Max
Grabb, Declan
Franks, Amy
Gershan, Scott
Kunstman, Kaitlyn N.
Lulla, Aaron
Roots, Monika Drummond
Sharma, Manu
Shrivastava, Aryan
Vasan, Nina
Waickman, Colleen
author_facet Lamparth, Max
Grabb, Declan
Franks, Amy
Gershan, Scott
Kunstman, Kaitlyn N.
Lulla, Aaron
Roots, Monika Drummond
Sharma, Manu
Shrivastava, Aryan
Vasan, Nina
Waickman, Colleen
contents Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. In psychiatry especially, these challenges are worsened by fairness and bias issues, since models can be swayed by patient demographics even when those factors should not influence clinical decisions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This U.S.-centric dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables, e.g., for age or ethnicity, and are available for male, female, or non-binary-coded patients. This design enables systematic evaluations of model performance and bias by studying how demographic factors affect decision-making. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating sixteen off-the-shelf and six (mental) health fine-tuned LMs on category-specific task accuracy, on the fairness impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human-annotated samples.
format Preprint
id arxiv_https___arxiv_org_abs_2502_16051
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Lamparth, Max
Grabb, Declan
Franks, Amy
Gershan, Scott
Kunstman, Kaitlyn N.
Lulla, Aaron
Roots, Monika Drummond
Sharma, Manu
Shrivastava, Aryan
Vasan, Nina
Waickman, Colleen
Computation and Language
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. In psychiatry especially, these challenges are worsened by fairness and bias issues, since models can be swayed by patient demographics even when those factors should not influence clinical decisions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This U.S.-centric dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables, e.g., for age or ethnicity, and are available for male, female, or non-binary-coded patients. This design enables systematic evaluations of model performance and bias by studying how demographic factors affect decision-making. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating sixteen off-the-shelf and six (mental) health fine-tuned LMs on category-specific task accuracy, on the fairness impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human-annotated samples.
title Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
topic Computation and Language
url https://arxiv.org/abs/2502.16051