Saved in:
Bibliographic Details
Main Author: Tacke, Felix
Format: Recurso digital
Language:
Published: Zenodo 2026
Online Access:https://doi.org/10.5281/zenodo.18740955
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866901174515400704
author Tacke, Felix
author_facet Tacke, Felix
contents <p>This record contains the complete CO.PRE.PAN (Corpus de Prensa Panhispánico) press corpus, organized into country-specific ZIP archives with linguistically annotated JSON files. Due to copyright restrictions, all texts and annotations are distributed under restricted access and cannot be shared openly. Users may request access directly through Zenodo.</p> <p>Contents of this record</p> <p>Each {COUNTRYCODE}.zip archive contains:</p> <ol> <li>Press texts in plain text format (txt-files/)</li> <li>Annotated JSON files (json-annotated/)</li> </ol> <p>All ZIP archives were generated using the internal script "zenodo_corpus_zip.py", which automatically tracks timestamps and file changes to ensure reproducible versioning.</p> <p>Corpus description</p> <p>CO.PRE.PAN is a cross-national corpus of written press Spanish from 18 Spanish-speaking countries, comprising over 14 million words. It is structurally aligned with the spoken broadcast corpus CO.RA.PAN (Corpus Radiofónico Panhispánico) and serves as a scripted register baseline for comparative analyses of national standard varieties of Spanish. All texts are drawn from comparable press genres and produced under broadly equivalent publication conditions across countries, ensuring cross-national comparability.</p> <p>Versioning</p> <p>Each version of this record represents a coherent snapshot of the full corpus at a specific point in time. Updates may include newly added texts, corrected or extended annotations, and improvements to preprocessing and linguistic annotation.</p> <p>Annotation details</p> <p>Each JSON file contains:</p> <ul> <li>tokenization, sentence segmentation</li> <li>POS tags, lemmas, and morphological features</li> <li>dependency relations</li> <li>automatic categorization of verbal tense and related features</li> </ul> <p>All annotations are generated using spaCy (model: es_dep_news_trf), followed by project-specific quality control steps, using the same annotation pipeline applied to CO.RA.PAN.</p> <p>Legal and access information</p> <p>The restricted status of this record is due to copyright limitations. Only short text extracts may be displayed publicly under scientific quotation rules and text-and-data-mining provisions of EU Directive 2019/790 and the German UrhG (§51, §60d, §44b). Redistribution or reuse of the full texts and annotations is not permitted.</p> <p>Access requests can be submitted directly through Zenodo. For scientific inquiries or technical questions, please contact the CO.PRE.PAN project team.</p>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_18740955
institution Zenodo
language
publishDate 2026
publisher Zenodo
record_format zenodo
spellingShingle CO.PRE.PAN Full Corpus (Restricted)
Tacke, Felix
<p>This record contains the complete CO.PRE.PAN (Corpus de Prensa Panhispánico) press corpus, organized into country-specific ZIP archives with linguistically annotated JSON files. Due to copyright restrictions, all texts and annotations are distributed under restricted access and cannot be shared openly. Users may request access directly through Zenodo.</p> <p>Contents of this record</p> <p>Each {COUNTRYCODE}.zip archive contains:</p> <ol> <li>Press texts in plain text format (txt-files/)</li> <li>Annotated JSON files (json-annotated/)</li> </ol> <p>All ZIP archives were generated using the internal script "zenodo_corpus_zip.py", which automatically tracks timestamps and file changes to ensure reproducible versioning.</p> <p>Corpus description</p> <p>CO.PRE.PAN is a cross-national corpus of written press Spanish from 18 Spanish-speaking countries, comprising over 14 million words. It is structurally aligned with the spoken broadcast corpus CO.RA.PAN (Corpus Radiofónico Panhispánico) and serves as a scripted register baseline for comparative analyses of national standard varieties of Spanish. All texts are drawn from comparable press genres and produced under broadly equivalent publication conditions across countries, ensuring cross-national comparability.</p> <p>Versioning</p> <p>Each version of this record represents a coherent snapshot of the full corpus at a specific point in time. Updates may include newly added texts, corrected or extended annotations, and improvements to preprocessing and linguistic annotation.</p> <p>Annotation details</p> <p>Each JSON file contains:</p> <ul> <li>tokenization, sentence segmentation</li> <li>POS tags, lemmas, and morphological features</li> <li>dependency relations</li> <li>automatic categorization of verbal tense and related features</li> </ul> <p>All annotations are generated using spaCy (model: es_dep_news_trf), followed by project-specific quality control steps, using the same annotation pipeline applied to CO.RA.PAN.</p> <p>Legal and access information</p> <p>The restricted status of this record is due to copyright limitations. Only short text extracts may be displayed publicly under scientific quotation rules and text-and-data-mining provisions of EU Directive 2019/790 and the German UrhG (§51, §60d, §44b). Redistribution or reuse of the full texts and annotations is not permitted.</p> <p>Access requests can be submitted directly through Zenodo. For scientific inquiries or technical questions, please contact the CO.PRE.PAN project team.</p>
title CO.PRE.PAN Full Corpus (Restricted)
url https://doi.org/10.5281/zenodo.18740955