Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jones, Precious, Liu, Weisi, Huang, I-Chan, Huang, Xiaolei
Format:	Preprint
Published:	2024
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2412.17803
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917922289483776
author	Jones, Precious Liu, Weisi Huang, I-Chan Huang, Xiaolei
author_facet	Jones, Precious Liu, Weisi Huang, I-Chan Huang, Xiaolei
contents	Data imbalance is a fundamental challenge in applying language models to biomedical applications, particularly in ICD code prediction tasks where label and demographic distributions are uneven. While state-of-the-art language models have been increasingly adopted in biomedical tasks, few studies have systematically examined how data imbalance affects model performance and fairness across demographic groups. This study fills the gap by statistically probing the relationship between data imbalance and model performance in ICD code prediction. We analyze imbalances in a standard benchmark data across gender, age, ethnicity, and social determinants of health by state-of-the-art biomedical language models. By deploying diverse performance metrics and statistical analyses, we explore the influence of data imbalance on performance variations and demographic fairness. Our study shows that data imbalance significantly impacts model performance and fairness, but feature similarity to the majority class may be a more critical factor. We believe this study provides valuable insights for developing more equitable and robust language models in healthcare applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_17803
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models Jones, Precious Liu, Weisi Huang, I-Chan Huang, Xiaolei Machine Learning Data imbalance is a fundamental challenge in applying language models to biomedical applications, particularly in ICD code prediction tasks where label and demographic distributions are uneven. While state-of-the-art language models have been increasingly adopted in biomedical tasks, few studies have systematically examined how data imbalance affects model performance and fairness across demographic groups. This study fills the gap by statistically probing the relationship between data imbalance and model performance in ICD code prediction. We analyze imbalances in a standard benchmark data across gender, age, ethnicity, and social determinants of health by state-of-the-art biomedical language models. By deploying diverse performance metrics and statistical analyses, we explore the influence of data imbalance on performance variations and demographic fairness. Our study shows that data imbalance significantly impacts model performance and fairness, but feature similarity to the majority class may be a more critical factor. We believe this study provides valuable insights for developing more equitable and robust language models in healthcare applications.
title	Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models
topic	Machine Learning
url	https://arxiv.org/abs/2412.17803

Similar Items