Saved in:
Bibliographic Details
Main Authors: Marivate, Vukosi, Olaleye, Kayode, Mundia, Sitwala, Bakainga, Andinda, Netshifhefhe, Unarine, Milanzie, Mahmooda, Mogale, Tsholofelo Hope, Sindane, Thapelo, Abdulrasaq, Zainab, Mokgosi, Kesego, Okorie, Chijioke, Van Wyk, Nia Zion, Morrissey, Graham, Dunbar, Dale, Smit, Francois, Chidi, Tsosheletso, Mabuya, Rooweither, Bukula, Andiswa, Mlambo, Respect, Macucwa, Tebogo, Abdulmumin, Idris, Rananga, and Seani
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.02201
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.