Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.01745 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908933352849408 |
|---|---|
| author | Berbatova, Melania Vasev, Tsvetoslav |
| author_facet | Berbatova, Melania Vasev, Tsvetoslav |
| contents | Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_01745 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text Berbatova, Melania Vasev, Tsvetoslav Computation and Language Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems. |
| title | Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2604.01745 |