MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Bhattarai, Kriti, Keloth, Vipina K., Wright, Donald, Loza, Andrew, Ren, Yang, Xu, Hua
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computation and Language Information Retrieval
Accesso online:	https://arxiv.org/abs/2601.12632
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866918295089709056
author	Bhattarai, Kriti Keloth, Vipina K. Wright, Donald Loza, Andrew Ren, Yang Xu, Hua
author_facet	Bhattarai, Kriti Keloth, Vipina K. Wright, Donald Loza, Andrew Ren, Yang Xu, Hua
contents	Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_12632
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models Bhattarai, Kriti Keloth, Vipina K. Wright, Donald Loza, Andrew Ren, Yang Xu, Hua Computation and Language Information Retrieval Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs.
title	BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models
topic	Computation and Language Information Retrieval
url	https://arxiv.org/abs/2601.12632

Documenti analoghi