Guardat en:
Dades bibliogràfiques
Autor principal: Yung, Yiu Cheong
Format: Recurso digital
Idioma:
Publicat: Zenodo 2025
Matèries:
Accés en línia:https://doi.org/10.5281/zenodo.16875236
Etiquetes: Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
Taula de continguts:
  • <p>Database Description</p> <ul> <li>Language: Hong Kong Cantonese, Traditional Chinese</li> <li>Size: ~49.2 GB (SQL dump), 11.1 GB (7z archive)</li> <li>Format: MySQL dump, UTF-8 encoding</li> <li>Source: public web sources (news sites, online forums, encyclopedia and restaurant reviews)</li> </ul> <p>⚠ This dataset provides the MySQL dump file which contains a large-scale raw text corpus collected from various Hong Kong public web sources, primarily focused on Hong Kong Cantonese and Traditional Chinese language usage.  </p> <p>It was used for generating Hong Kong Content Corpus, which was then used in the experiments reported in https://doi.org/10.1145/3744341 to study the effect of diglossia on Hong Kong language modeling.  </p> <p>This MySQL database is intended for archival and reproducibility purposes, and may include noise, duplication, HTML markup, crawler residues, and records that were subsequently cleaned/filtered in the derived corpus release.  </p> <p>This dataset is also available at HuggingFace as unsplited archive: https://huggingface.co/datasets/SolarisCipher/hk_content_corpus_mysql  </p> <p> If you are looking for the cleaned, ready-to-use corpus version, please refer to:  <br>https://doi.org/10.5281/zenodo.16882351  </p> <p>NOTE: HKNSL became effective since 2020-6-30, which can create bias on user content created afterwards. Those portion of data should be used with caution.<br><br>SHA256 checksum of files:<br>0c279f564d4fb02fe7b05c7d424d8e0497e7c26d9caeb3fd6c31d2561b6c4d83 hk_content.7z.001<br>140c5f335799cc783d1ccadfce68f19d5efc6dba1794255c29445cec30bebfcb hk_content.7z.002<br>efa6912d3792a21833808339725f17428341217d50d983ccf426d205c6104a38 hk_content.7z.003<br>b3b7a600ec2e2b5c6ce9ebc1e545712e696c6f6f94b78d0473486609eb7fb854  [SQL file after decompression]<br><br>If you use this database, please cite the following paper, and optionally cite the database DOI:<br>@article{Yung2025HKDiglossia,<br>  author    = {Yung, Yiu Cheong and Lin, Ying-Jia and Kao, Hung-Yu},<br>  title     = {Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content},<br>  journal   = {ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)},<br>  volume    = {24},<br>  number    = {7},<br>  pages     = {71:1--71:16},<br>  year      = {2025},<br>  publisher = {Association for Computing Machinery},<br>  doi       = {10.1145/3744341}<br>}</p>