Сохранить в:
| Главный автор: | |
|---|---|
| Формат: | Recurso digital |
| Язык: | киргизский |
| Опубликовано: |
Zenodo
2026
|
| Предметы: | |
| Online-ссылка: | https://doi.org/10.5281/zenodo.19327769 |
| Метки: |
Добавить метку
Нет меток, Требуется 1-ая метка записи!
|
Оглавление:
- <p>The Central Asian Turkic Discourse Corpus (CATDiC) is a pilot corpus of discourse data from five Turkic languages:</p> <p>- Uzbek <br>- Kazakh <br>- Turkmen <br>- Kyrgyz <br>- Uyghur </p> <p>Each dataset consists of approximately 2 minutes and 15 seconds of annotated dialogue.</p> <p>The corpus includes:</p> <p>- Video recordings (.mp4)<br>- Audio recordings (.wav)<br>- ELAN annotation files (.eaf)<br>- Structured metadata (metadata.csv, speakers.csv)</p> <p>However, Turkmen data is excluded due to consent restrictions.</p> <p>## Annotation</p> <p>The corpus is annotated using ELAN with multiple tiers.</p> <p>Each tier name consists of an annotation label and a speaker identifier, separated by the "@" symbol. <br>The part before "@" indicates the type of annotation, and the part after "@" indicates the speaker ID.</p> <p>Speaker identifiers vary across recordings and are anonymized. <br>Detailed speaker information is provided in the metadata.</p> <p>The annotation layers include:</p> <p>- transcription<br>- Japanese translation<br>- segmentation<br>- morpheme-level annotation<br>- glossing<br>- discourse annotation (DA_dim, DA_func)<br>- annotation indices</p> <p>## Notes</p> <p>This is a pilot dataset. <br>Some annotation layers may be incomplete or exploratory.</p> <p>Sensitive information has been anonymized. <br>Proper names in the recordings have been masked in both audio and transcription where necessary.</p> <p>## Usage</p> <p>CATDiC is intended for:</p> <p>- Discourse analysis<br>- Turkic linguistics<br>- Comparative studies<br>- Annotation research</p> <p>## License</p> <p>This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.</p> <p>## Data Format</p> <p>Data files are provided as compressed archives (.zip). <br>Please extract them before use.</p> <p>##Note</p> <p>Metadata files (metadata.csv, speakers.csv) have been added in this version.</p>