Saved in:
Bibliographic Details
Main Authors: Marivate, Vukosi, Olaleye, Kayode, Mundia, Sitwala, Bakainga, Andinda, Netshifhefhe, Unarine, Milanzie, Mahmooda, Mogale, Tsholofelo Hope, Sindane, Thapelo, Abdulrasaq, Zainab, Mokgosi, Kesego, Okorie, Chijioke, Van Wyk, Nia Zion, Morrissey, Graham, Dunbar, Dale, Smit, Francois, Chidi, Tsosheletso, Mabuya, Rooweither, Bukula, Andiswa, Mlambo, Respect, Macucwa, Tebogo, Abdulmumin, Idris, Rananga, and Seani
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.02201
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914262530654208
author Marivate, Vukosi
Olaleye, Kayode
Mundia, Sitwala
Bakainga, Andinda
Netshifhefhe, Unarine
Milanzie, Mahmooda
Mogale, Tsholofelo Hope
Sindane, Thapelo
Abdulrasaq, Zainab
Mokgosi, Kesego
Okorie, Chijioke
Van Wyk, Nia Zion
Morrissey, Graham
Dunbar, Dale
Smit, Francois
Chidi, Tsosheletso
Mabuya, Rooweither
Bukula, Andiswa
Mlambo, Respect
Macucwa, Tebogo
Abdulmumin, Idris
Rananga, and Seani
author_facet Marivate, Vukosi
Olaleye, Kayode
Mundia, Sitwala
Bakainga, Andinda
Netshifhefhe, Unarine
Milanzie, Mahmooda
Mogale, Tsholofelo Hope
Sindane, Thapelo
Abdulrasaq, Zainab
Mokgosi, Kesego
Okorie, Chijioke
Van Wyk, Nia Zion
Morrissey, Graham
Dunbar, Dale
Smit, Francois
Chidi, Tsosheletso
Mabuya, Rooweither
Bukula, Andiswa
Mlambo, Respect
Macucwa, Tebogo
Abdulmumin, Idris
Rananga, and Seani
contents This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
format Preprint
id arxiv_https___arxiv_org_abs_2512_02201
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Swivuriso: The South African Next Voices Multilingual Speech Dataset
Marivate, Vukosi
Olaleye, Kayode
Mundia, Sitwala
Bakainga, Andinda
Netshifhefhe, Unarine
Milanzie, Mahmooda
Mogale, Tsholofelo Hope
Sindane, Thapelo
Abdulrasaq, Zainab
Mokgosi, Kesego
Okorie, Chijioke
Van Wyk, Nia Zion
Morrissey, Graham
Dunbar, Dale
Smit, Francois
Chidi, Tsosheletso
Mabuya, Rooweither
Bukula, Andiswa
Mlambo, Respect
Macucwa, Tebogo
Abdulmumin, Idris
Rananga, and Seani
Computation and Language
This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
title Swivuriso: The South African Next Voices Multilingual Speech Dataset
topic Computation and Language
url https://arxiv.org/abs/2512.02201