Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Greve, Jan, Sablica, Lukas
Format:	Preprint
Published:	2025
Subjects:	Applications
Online Access:	https://arxiv.org/abs/2505.21128
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912397929742336
author	Greve, Jan Sablica, Lukas
author_facet	Greve, Jan Sablica, Lukas
contents	This work introduces an anonymization scheme for a corpus of texts to safeguard metadata from disclosure. It specifically aims to prevent large language models from identifying metadata associated with texts, thereby avoiding their influence on query responses. The core mechanism is called named entity swapping, a technique inspired by data swapping in statistical disclosure control. Our method randomly selects pairs of semantically similar substrings from different texts based on the similarity of their embedding vectors and interchanges some named entities between them. This prevents certain combinations of named entities from being uniquely associated with the metadata of individual texts. Our approach offers two key advantages. First, it enables users to determine the optimal level of anonymization that balances data utility and data risk through a calibration of several key decision variables. Second, it leverages text embeddings both to compute swapping weights and to assess data utility, enabling a high degree of flexibility and customization in the overall workflow. The effectiveness of the proposed method is demonstrated with an application that prevents the disclosure of company names in a cross-sectional dataset of earnings call transcripts.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_21128
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Named Entity Swapping for Metadata Anonymization in a Text Corpus Greve, Jan Sablica, Lukas Applications This work introduces an anonymization scheme for a corpus of texts to safeguard metadata from disclosure. It specifically aims to prevent large language models from identifying metadata associated with texts, thereby avoiding their influence on query responses. The core mechanism is called named entity swapping, a technique inspired by data swapping in statistical disclosure control. Our method randomly selects pairs of semantically similar substrings from different texts based on the similarity of their embedding vectors and interchanges some named entities between them. This prevents certain combinations of named entities from being uniquely associated with the metadata of individual texts. Our approach offers two key advantages. First, it enables users to determine the optimal level of anonymization that balances data utility and data risk through a calibration of several key decision variables. Second, it leverages text embeddings both to compute swapping weights and to assess data utility, enabling a high degree of flexibility and customization in the overall workflow. The effectiveness of the proposed method is demonstrated with an application that prevents the disclosure of company names in a cross-sectional dataset of earnings call transcripts.
title	Named Entity Swapping for Metadata Anonymization in a Text Corpus
topic	Applications
url	https://arxiv.org/abs/2505.21128

Similar Items