Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wastl, Michelle, Vamvas, Jannis, Calleri, Selena, Sennrich, Rico
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2504.21677
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916715133140992
author	Wastl, Michelle Vamvas, Jannis Calleri, Selena Sennrich, Rico
author_facet	Wastl, Michelle Vamvas, Jannis Calleri, Selena Sennrich, Rico
contents	We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_21677
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	20min-XD: A Comparable Corpus of Swiss News Articles Wastl, Michelle Vamvas, Jannis Calleri, Selena Sennrich, Rico Computation and Language We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments.
title	20min-XD: A Comparable Corpus of Swiss News Articles
topic	Computation and Language
url	https://arxiv.org/abs/2504.21677

Similar Items