Saved in:
Bibliographic Details
Main Authors: Nguyen, Dang, Li, Zeman, Bateni, Mohammadhossein, Mirrokni, Vahab, Razaviyayn, Meisam, Mirzasoleiman, Baharan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.17607
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908397207552000
author Nguyen, Dang
Li, Zeman
Bateni, Mohammadhossein
Mirrokni, Vahab
Razaviyayn, Meisam
Mirzasoleiman, Baharan
author_facet Nguyen, Dang
Li, Zeman
Bateni, Mohammadhossein
Mirrokni, Vahab
Razaviyayn, Meisam
Mirzasoleiman, Baharan
contents Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.
format Preprint
id arxiv_https___arxiv_org_abs_2502_17607
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Synthetic Text Generation for Training Large Language Models via Gradient Matching
Nguyen, Dang
Li, Zeman
Bateni, Mohammadhossein
Mirrokni, Vahab
Razaviyayn, Meisam
Mirzasoleiman, Baharan
Machine Learning
Computation and Language
Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.
title Synthetic Text Generation for Training Large Language Models via Gradient Matching
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2502.17607