Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Iglesias, Guillermo, Bello-Orgaz, Gema, Navas-Loro, María, Ramirez-Atencia, Cristian, Robert, Mercè Salvador, Baca-Garcia, Enrique
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Cryptography and Security
Online Access:	https://arxiv.org/abs/2604.27014
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915968004915200
author	Iglesias, Guillermo Bello-Orgaz, Gema Navas-Loro, María Ramirez-Atencia, Cristian Robert, Mercè Salvador Baca-Garcia, Enrique
author_facet	Iglesias, Guillermo Bello-Orgaz, Gema Navas-Loro, María Ramirez-Atencia, Cristian Robert, Mercè Salvador Baca-Garcia, Enrique
contents	The scarcity of high-quality annotated medical data, particularly in mental health, poses a significant bottleneck for training robust machine learning models. Privacy regulations restrict data sharing, making synthetic data generation a promising alternative. The use of Large Language Models (LLMs) in a data augmentation pipeline could be leveraged as an alternative in this field. In the proposed methodology, DeepSeek-R1, OpenBioLLM-Llama3 and Qwen 3.5 are used to generate synthetic mental health evaluation reports conditioned on specific International Classification of Diseases, Tenth Revision (ICD-10) codes. Because naive text generation can lead to mode collapse or privacy breaches (memorization), a comprehensive evaluation framework is introduced. The generated diagnostic texts are assessed across three dimensions: semantic fidelity, lexical diversity, and privacy/plagiarism. The results demonstrate that all models can generate clinically coherent, diverse, and privacy-safe synthetic reports, significantly expanding the available training data for clinical natural language processing tasks without compromising patient confidentiality.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_27014
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation Iglesias, Guillermo Bello-Orgaz, Gema Navas-Loro, María Ramirez-Atencia, Cristian Robert, Mercè Salvador Baca-Garcia, Enrique Machine Learning Cryptography and Security The scarcity of high-quality annotated medical data, particularly in mental health, poses a significant bottleneck for training robust machine learning models. Privacy regulations restrict data sharing, making synthetic data generation a promising alternative. The use of Large Language Models (LLMs) in a data augmentation pipeline could be leveraged as an alternative in this field. In the proposed methodology, DeepSeek-R1, OpenBioLLM-Llama3 and Qwen 3.5 are used to generate synthetic mental health evaluation reports conditioned on specific International Classification of Diseases, Tenth Revision (ICD-10) codes. Because naive text generation can lead to mode collapse or privacy breaches (memorization), a comprehensive evaluation framework is introduced. The generated diagnostic texts are assessed across three dimensions: semantic fidelity, lexical diversity, and privacy/plagiarism. The results demonstrate that all models can generate clinically coherent, diverse, and privacy-safe synthetic reports, significantly expanding the available training data for clinical natural language processing tasks without compromising patient confidentiality.
title	Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation
topic	Machine Learning Cryptography and Security
url	https://arxiv.org/abs/2604.27014

Similar Items