Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mukerji, Arjun, Jackson, Michael L., Jones, Jason, Sanghavi, Neil
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2506.18819
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909657567592448
author	Mukerji, Arjun Jackson, Michael L. Jones, Jason Sanghavi, Neil
author_facet	Mukerji, Arjun Jackson, Michael L. Jones, Jason Sanghavi, Neil
contents	Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2506_18819
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies Mukerji, Arjun Jackson, Michael L. Jones, Jason Sanghavi, Neil Computation and Language Artificial Intelligence Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.
title	RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2506.18819

Similar Items