Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nagesh, Supriya, Chen, Justin Y., Mishra, Nina, Wagner, Tal
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.13193
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915160314085376
author	Nagesh, Supriya Chen, Justin Y. Mishra, Nina Wagner, Tal
author_facet	Nagesh, Supriya Chen, Justin Y. Mishra, Nina Wagner, Tal
contents	We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus while achieving requisite output diversity and maintaining differential privacy. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones. Our findings offer hope that institutions can reap ML insights by privately sharing data with simple prompts and little compute.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_13193
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Private Text Generation by Seeding Large Language Model Prompts Nagesh, Supriya Chen, Justin Y. Mishra, Nina Wagner, Tal Computation and Language We explore how private synthetic text can be generated by suitably prompting a large language model (LLM). This addresses a challenge for organizations like hospitals, which hold sensitive text data like patient medical records, and wish to share it in order to train machine learning models for medical tasks, while preserving patient privacy. Methods that rely on training or finetuning a model may be out of reach, either due to API limits of third-party LLMs, or due to ethical and legal prohibitions on sharing the private data with the LLM itself. We propose Differentially Private Keyphrase Prompt Seeding (DP-KPS), a method that generates a private synthetic text corpus from a sensitive input corpus, by accessing an LLM only through privatized prompts. It is based on seeding the prompts with private samples from a distribution over phrase embeddings, thus capturing the input corpus while achieving requisite output diversity and maintaining differential privacy. We evaluate DP-KPS on downstream ML text classification tasks, and show that the corpora it generates preserve much of the predictive power of the original ones. Our findings offer hope that institutions can reap ML insights by privately sharing data with simple prompts and little compute.
title	Private Text Generation by Seeding Large Language Model Prompts
topic	Computation and Language
url	https://arxiv.org/abs/2502.13193

Similar Items