Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shukla, Aditya, Yuan, Yining, Tamo, Ben, Wang, Yifei, Nnamdi, Micky, Tan, Shaun, Li, Jieru, Marteau, Benoit, Willingham, Brad, Wang, May
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.01557
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912937098084352
author	Shukla, Aditya Yuan, Yining Tamo, Ben Wang, Yifei Nnamdi, Micky Tan, Shaun Li, Jieru Marteau, Benoit Willingham, Brad Wang, May
author_facet	Shukla, Aditya Yuan, Yining Tamo, Ben Wang, Yifei Nnamdi, Micky Tan, Shaun Li, Jieru Marteau, Benoit Willingham, Brad Wang, May
contents	Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_01557
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring Shukla, Aditya Yuan, Yining Tamo, Ben Wang, Yifei Nnamdi, Micky Tan, Shaun Li, Jieru Marteau, Benoit Willingham, Brad Wang, May Artificial Intelligence Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization.
title	Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
topic	Artificial Intelligence
url	https://arxiv.org/abs/2603.01557

Similar Items