Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Masoka, Happymore
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2509.14249
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912591320711168
author	Masoka, Happymore
author_facet	Masoka, Happymore
contents	African languages remain underrepresented in natural language processing (NLP), with most corpora limited to formal registers that fail to capture the vibrancy of everyday communication. This work addresses this gap for Shona, a Bantu language spoken in Zimbabwe and Zambia, by introducing a novel Shona--English slang dataset curated from anonymized social media conversations. The dataset is annotated for intent, sentiment, dialogue acts, code-mixing, and tone, and is publicly available at https://github.com/HappymoreMasoka/Working_with_shona-slang. We fine-tuned a multilingual DistilBERT classifier for intent recognition, achieving 96.4\% accuracy and 96.3\% F1-score, hosted at https://huggingface.co/HappymoreMasoka. This classifier is integrated into a hybrid chatbot that combines rule-based responses with retrieval-augmented generation (RAG) to handle domain-specific queries, demonstrated through a use case assisting prospective students with graduate program information at Pace University. Qualitative evaluation shows the hybrid system outperforms a RAG-only baseline in cultural relevance and user engagement. By releasing the dataset, model, and methodology, this work advances NLP resources for African languages, promoting inclusive and culturally resonant conversational AI.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_14249
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Advancing Conversational AI with Shona Slang: A Dataset and Hybrid Model for Digital Inclusion Masoka, Happymore Computation and Language Artificial Intelligence African languages remain underrepresented in natural language processing (NLP), with most corpora limited to formal registers that fail to capture the vibrancy of everyday communication. This work addresses this gap for Shona, a Bantu language spoken in Zimbabwe and Zambia, by introducing a novel Shona--English slang dataset curated from anonymized social media conversations. The dataset is annotated for intent, sentiment, dialogue acts, code-mixing, and tone, and is publicly available at https://github.com/HappymoreMasoka/Working_with_shona-slang. We fine-tuned a multilingual DistilBERT classifier for intent recognition, achieving 96.4\% accuracy and 96.3\% F1-score, hosted at https://huggingface.co/HappymoreMasoka. This classifier is integrated into a hybrid chatbot that combines rule-based responses with retrieval-augmented generation (RAG) to handle domain-specific queries, demonstrated through a use case assisting prospective students with graduate program information at Pace University. Qualitative evaluation shows the hybrid system outperforms a RAG-only baseline in cultural relevance and user engagement. By releasing the dataset, model, and methodology, this work advances NLP resources for African languages, promoting inclusive and culturally resonant conversational AI.
title	Advancing Conversational AI with Shona Slang: A Dataset and Hybrid Model for Digital Inclusion
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2509.14249

Similar Items