Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yan, Yiwei, Li, Hao, He, Hua, Kai, Gong, Yang, Zhengyi, Liu, Guanfeng
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2601.09717
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908766400675840
author	Yan, Yiwei Li, Hao He, Hua Kai, Gong Yang, Zhengyi Liu, Guanfeng
author_facet	Yan, Yiwei Li, Hao He, Hua Kai, Gong Yang, Zhengyi Liu, Guanfeng
contents	Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP-CG.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_09717
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data Yan, Yiwei Li, Hao He, Hua Kai, Gong Yang, Zhengyi Liu, Guanfeng Computation and Language Artificial Intelligence Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP-CG.
title	SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2601.09717

Similar Items