Saved in:
Bibliographic Details
Main Authors: Kambhatla, Gauri, Shaib, Chantal, Govindarajan, Venkata
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.17390
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Fine-grained personas have recently been used for generating 'diverse' synthetic data for pre-training and supervised fine-tuning of Large Language Models (LLMs). In this work, we measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics. First, we find that synthetic prompts/instructions are significantly less diverse than human-written ones. Next, we sample responses from LLMs of different sizes with fine-grained and coarse persona descriptions to investigate how much fine-grained detail in persona descriptions contribute to generated text diversity. Our results indicate that persona prompting produces higher lexical diversity than prompting without personas, particularly in larger models. In contrast, adding fine-grained persona details yields minimal gains in diversity compared to simply specifying a length cutoff in the prompt.