Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Choi, Janghyeok, Lee, Jaewon, Cho, Sungzoon
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Artificial Intelligence Information Retrieval
Online Access:	https://arxiv.org/abs/2603.22765
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910068245528576
author	Choi, Janghyeok Lee, Jaewon Cho, Sungzoon
author_facet	Choi, Janghyeok Lee, Jaewon Cho, Sungzoon
contents	Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas--such as attorneys, prosecutors, and judges--to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_22765
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona Choi, Janghyeok Lee, Jaewon Cho, Sungzoon Computation and Language Artificial Intelligence Information Retrieval Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas--such as attorneys, prosecutors, and judges--to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.
title	DALDALL: Data Augmentation for Lexical and Semantic Diverse in Legal Domain by leveraging LLM-Persona
topic	Computation and Language Artificial Intelligence Information Retrieval
url	https://arxiv.org/abs/2603.22765

Similar Items