Saved in:
| Main Author: | |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.09701 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917169263017984 |
|---|---|
| author | Xu, Binbin |
| author_facet | Xu, Binbin |
| contents | We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_09701 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text Xu, Binbin Computation and Language We present FineFreq, a large-scale multilingual character frequency dataset derived from the FineWeb and FineWeb2 corpora, covering over 1900 languages and spanning 2013-2025. The dataset contains frequency counts for 96 trillion characters processed from 57 TB of compressed text. For each language, FineFreq provides per-character statistics with aggregate and year-level frequencies, allowing fine-grained temporal analysis. The dataset preserves naturally occurring multilingual features such as cross-script borrowings, emoji, and acronyms without applying artificial filtering. Each character entry includes Unicode metadata (category, script, block), enabling domain-specific or other downstream filtering and analysis. The full dataset is released in both CSV and Parquet formats, with associated metadata, available on GitHub and HuggingFace. https://github.com/Bin-2/FineFreq |
| title | FineFreq: A Multilingual Character Frequency Dataset from Web-Scale Text |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2512.09701 |