Guardat en:
| Autor principal: | |
|---|---|
| Format: | Recurso digital |
| Idioma: | |
| Publicat: |
Zenodo
2025
|
| Matèries: | |
| Accés en línia: | https://doi.org/10.5281/zenodo.16875236 |
| Etiquetes: |
Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
|
Taula de continguts:
- <p>Database Description</p> <ul> <li>Language: Hong Kong Cantonese, Traditional Chinese</li> <li>Size: ~49.2 GB (SQL dump), 11.1 GB (7z archive)</li> <li>Format: MySQL dump, UTF-8 encoding</li> <li>Source: public web sources (news sites, online forums, encyclopedia and restaurant reviews)</li> </ul> <p>⚠ This dataset provides the MySQL dump file which contains a large-scale raw text corpus collected from various Hong Kong public web sources, primarily focused on Hong Kong Cantonese and Traditional Chinese language usage. </p> <p>It was used for generating Hong Kong Content Corpus, which was then used in the experiments reported in https://doi.org/10.1145/3744341 to study the effect of diglossia on Hong Kong language modeling. </p> <p>This MySQL database is intended for archival and reproducibility purposes, and may include noise, duplication, HTML markup, crawler residues, and records that were subsequently cleaned/filtered in the derived corpus release. </p> <p>This dataset is also available at HuggingFace as unsplited archive: https://huggingface.co/datasets/SolarisCipher/hk_content_corpus_mysql </p> <p> If you are looking for the cleaned, ready-to-use corpus version, please refer to: <br>https://doi.org/10.5281/zenodo.16882351 </p> <p>NOTE: HKNSL became effective since 2020-6-30, which can create bias on user content created afterwards. Those portion of data should be used with caution.<br><br>SHA256 checksum of files:<br>0c279f564d4fb02fe7b05c7d424d8e0497e7c26d9caeb3fd6c31d2561b6c4d83 hk_content.7z.001<br>140c5f335799cc783d1ccadfce68f19d5efc6dba1794255c29445cec30bebfcb hk_content.7z.002<br>efa6912d3792a21833808339725f17428341217d50d983ccf426d205c6104a38 hk_content.7z.003<br>b3b7a600ec2e2b5c6ce9ebc1e545712e696c6f6f94b78d0473486609eb7fb854 [SQL file after decompression]<br><br>If you use this database, please cite the following paper, and optionally cite the database DOI:<br>@article{Yung2025HKDiglossia,<br> author = {Yung, Yiu Cheong and Lin, Ying-Jia and Kao, Hung-Yu},<br> title = {Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content},<br> journal = {ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)},<br> volume = {24},<br> number = {7},<br> pages = {71:1--71:16},<br> year = {2025},<br> publisher = {Association for Computing Machinery},<br> doi = {10.1145/3744341}<br>}</p>