Saved in:
Bibliographic Details
Main Authors: Wu, Peiran, Liu, Che, Chen, Canyu, Li, Jun, Bercea, Cosmin I., Arcucci, Rossella
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2410.01089
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908068456955904
author Wu, Peiran
Liu, Che
Chen, Canyu
Li, Jun
Bercea, Cosmin I.
Arcucci, Rossella
author_facet Wu, Peiran
Liu, Che
Chen, Canyu
Li, Jun
Bercea, Cosmin I.
Arcucci, Rossella
contents Advancements in Multimodal Large Language Models (MLLMs) have significantly improved medical task performance, such as Visual Question Answering (VQA) and Report Generation (RG). However, the fairness of these models across diverse demographic groups remains underexplored, despite its importance in healthcare. This oversight is partly due to the lack of demographic diversity in existing medical multimodal datasets, which complicates the evaluation of fairness. In response, we propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes. FMBench has the following key features: 1: It includes four demographic attributes: race, ethnicity, language, and gender, across two tasks, VQA and RG, under zero-shot settings. 2: Our VQA task is free-form, enhancing real-world applicability and mitigating the biases associated with predefined choices. 3: We utilize both lexical metrics and LLM-based metrics, aligned with clinical evaluations, to assess models not only for linguistic accuracy but also from a clinical perspective. Furthermore, we introduce a new metric, Fairness-Aware Performance (FAP), to evaluate how fairly MLLMs perform across various demographic attributes. We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical MLLMs, ranging from 7B to 26B parameters on the proposed benchmark. We aim for FMBench to assist the research community in refining model evaluation and driving future advancements in the field. All data and code will be released upon acceptance.
format Preprint
id arxiv_https___arxiv_org_abs_2410_01089
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks
Wu, Peiran
Liu, Che
Chen, Canyu
Li, Jun
Bercea, Cosmin I.
Arcucci, Rossella
Computer Vision and Pattern Recognition
Advancements in Multimodal Large Language Models (MLLMs) have significantly improved medical task performance, such as Visual Question Answering (VQA) and Report Generation (RG). However, the fairness of these models across diverse demographic groups remains underexplored, despite its importance in healthcare. This oversight is partly due to the lack of demographic diversity in existing medical multimodal datasets, which complicates the evaluation of fairness. In response, we propose FMBench, the first benchmark designed to evaluate the fairness of MLLMs performance across diverse demographic attributes. FMBench has the following key features: 1: It includes four demographic attributes: race, ethnicity, language, and gender, across two tasks, VQA and RG, under zero-shot settings. 2: Our VQA task is free-form, enhancing real-world applicability and mitigating the biases associated with predefined choices. 3: We utilize both lexical metrics and LLM-based metrics, aligned with clinical evaluations, to assess models not only for linguistic accuracy but also from a clinical perspective. Furthermore, we introduce a new metric, Fairness-Aware Performance (FAP), to evaluate how fairly MLLMs perform across various demographic attributes. We thoroughly evaluate the performance and fairness of eight state-of-the-art open-source MLLMs, including both general and medical MLLMs, ranging from 7B to 26B parameters on the proposed benchmark. We aim for FMBench to assist the research community in refining model evaluation and driving future advancements in the field. All data and code will be released upon acceptance.
title FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2410.01089