Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nguyen, Dang, Li, Zeman, Bateni, Mohammadhossein, Mirrokni, Vahab, Razaviyayn, Meisam, Mirzasoleiman, Baharan
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2502.17607
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908397207552000
author	Nguyen, Dang Li, Zeman Bateni, Mohammadhossein Mirrokni, Vahab Razaviyayn, Meisam Mirzasoleiman, Baharan
author_facet	Nguyen, Dang Li, Zeman Bateni, Mohammadhossein Mirrokni, Vahab Razaviyayn, Meisam Mirzasoleiman, Baharan
contents	Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_17607
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Synthetic Text Generation for Training Large Language Models via Gradient Matching Nguyen, Dang Li, Zeman Bateni, Mohammadhossein Mirrokni, Vahab Razaviyayn, Meisam Mirzasoleiman, Baharan Machine Learning Computation and Language Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.
title	Synthetic Text Generation for Training Large Language Models via Gradient Matching
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2502.17607

Similar Items