Gardado en:
| Main Authors: | , , |
|---|---|
| Formato: | Recurso digital |
| Idioma: | árabe |
| Publicado: |
Zenodo
2025
|
| Subjects: | |
| Acceso en liña: | https://doi.org/10.5281/zenodo.17855012 |
| Tags: |
Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
|
Table of Contents:
- <p>A synthetic dataset of 43,316 conversations with mean conversation length of 14.038 turns (rounded to 3 decimal places), median of 12 turns, range of 5-111 turns, and a total of 608,052 utterances (where every turn is an utterance).</p> <p>Dataset is partitioned into training and test sets. An 80/20 split was adopted (34,653 training conversations / 8,663 test conversations).</p> <p>The synthetic data generation process systematically iterated over 93 topics and 151 countries, creating 14,043 unique topic-country combinations. The generation pipeline was configured to produce 5 conversations per combination. After rigorous processing and train/test split based on techniques to mitigate leakge risks, the end result was 43,316 conversations.</p> <p> </p> <p> </p>