Saved in:
Bibliographic Details
Main Authors: Shukla, Aditya, Yuan, Yining, Tamo, Ben, Wang, Yifei, Nnamdi, Micky, Tan, Shaun, Li, Jieru, Marteau, Benoit, Willingham, Brad, Wang, May
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.01557
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912937098084352
author Shukla, Aditya
Yuan, Yining
Tamo, Ben
Wang, Yifei
Nnamdi, Micky
Tan, Shaun
Li, Jieru
Marteau, Benoit
Willingham, Brad
Wang, May
author_facet Shukla, Aditya
Yuan, Yining
Tamo, Ben
Wang, Yifei
Nnamdi, Micky
Tan, Shaun
Li, Jieru
Marteau, Benoit
Willingham, Brad
Wang, May
contents Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization.
format Preprint
id arxiv_https___arxiv_org_abs_2603_01557
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
Shukla, Aditya
Yuan, Yining
Tamo, Ben
Wang, Yifei
Nnamdi, Micky
Tan, Shaun
Li, Jieru
Marteau, Benoit
Willingham, Brad
Wang, May
Artificial Intelligence
Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated event mentions. We benchmark three approaches: zero-shot prompting, statistical prompting, and a vision-based pipeline that uses rendered time-series visualizations. The results reveal a striking decoupling between conventional metrics and clinical event fidelity. Models that achieve high semantic similarity scores often exhibit near-zero abnormality recall. In contrast, the vision-based approach demonstrates the strongest event alignment, achieving 45.7% abnormality recall and 100% duration recall. These findings underscore the importance of event-aware evaluation to ensure reliable clinical time-series summarization.
title Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring
topic Artificial Intelligence
url https://arxiv.org/abs/2603.01557