Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.21677 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916715133140992 |
|---|---|
| author | Wastl, Michelle Vamvas, Jannis Calleri, Selena Sennrich, Rico |
| author_facet | Wastl, Michelle Vamvas, Jannis Calleri, Selena Sennrich, Rico |
| contents | We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2504_21677 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | 20min-XD: A Comparable Corpus of Swiss News Articles Wastl, Michelle Vamvas, Jannis Calleri, Selena Sennrich, Rico Computation and Language We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments. |
| title | 20min-XD: A Comparable Corpus of Swiss News Articles |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2504.21677 |