Saved in:
Bibliographic Details
Main Authors: Bednarczyk, Lydie, Zaghir, Jamil, Ehrsam, Julien, Tcherepanova, Maria, Skalafouris, Christian, Gariani, Karim, Geslin, Catherine, Rivara, Claire-Bénédicte, Bonnabry, Pascal, Gosetto, Laetitia, Dubos, Richard, Bjelogrlic, Mina, Gaudet-Blavignac, Christophe, Lovis, Christian
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.04085
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911649776009216
author Bednarczyk, Lydie
Zaghir, Jamil
Ehrsam, Julien
Tcherepanova, Maria
Skalafouris, Christian
Gariani, Karim
Geslin, Catherine
Rivara, Claire-Bénédicte
Bonnabry, Pascal
Gosetto, Laetitia
Dubos, Richard
Bjelogrlic, Mina
Gaudet-Blavignac, Christophe
Lovis, Christian
author_facet Bednarczyk, Lydie
Zaghir, Jamil
Ehrsam, Julien
Tcherepanova, Maria
Skalafouris, Christian
Gariani, Karim
Geslin, Catherine
Rivara, Claire-Bénédicte
Bonnabry, Pascal
Gosetto, Laetitia
Dubos, Richard
Bjelogrlic, Mina
Gaudet-Blavignac, Christophe
Lovis, Christian
contents Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries.
format Preprint
id arxiv_https___arxiv_org_abs_2605_04085
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content
Bednarczyk, Lydie
Zaghir, Jamil
Ehrsam, Julien
Tcherepanova, Maria
Skalafouris, Christian
Gariani, Karim
Geslin, Catherine
Rivara, Claire-Bénédicte
Bonnabry, Pascal
Gosetto, Laetitia
Dubos, Richard
Bjelogrlic, Mina
Gaudet-Blavignac, Christophe
Lovis, Christian
Computers and Society
Artificial Intelligence
Computation and Language
Methodology
Objectives: Large language models (LLMs) are increasingly used for clinical text summarization, yet structured methods to assess associated patient safety risks remain limited. Failure Mode, Effects, and Criticality Analysis (FMECA) provides a proactive framework for systematic risk identification but has not been adapted to LLM-generated clinical content. This study aimed to develop and validate a novel FMECA framework for the prospective assessment of patient safety risks in LLM-generated clinical summaries. Materials and Methods: An interdisciplinary expert panel (n = 8) developed a taxonomy of failure modes through literature review and brainstorming. Standard FMECA dimensions (occurrence, severity, detectability) were adapted into 5-point ordinal scales. The framework was applied to 36 discharge summaries from four patients, generated by an open LLM (GPT-OSS 120B) using real-world clinical data from the Geneva University Hospitals. Reviewers independently annotated the summaries across two rounds. Inter-rater reliability was assessed at failure mode, severity and detectability score levels. Usability and content validity were evaluated using an adapted System Usability Scale and structured feedback. Results: The final framework comprised 14 failure modes organized into categories. Inter-rater agreement improved between rounds, reaching moderate-to-substantial agreement for failure mode identification and good agreement for severity and detectability scoring. Usability was rated as good (mean SUS: 79.2/100), with high evaluator confidence. Discussion and Conclusion: This study presents the first FMECA-based framework for systematic patient safety risk assessment of LLM-generated clinical summaries. The framework provides a structured and reproducible method for identifying clinically relevant risks caused by these summaries.
title Evaluating Patient Safety Risks in Generative AI: Development and Validation of a FMECA Framework for Generated Clinical Content
topic Computers and Society
Artificial Intelligence
Computation and Language
Methodology
url https://arxiv.org/abs/2605.04085