Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Marivate, Vukosi, Olaleye, Kayode, Mundia, Sitwala, Bakainga, Andinda, Netshifhefhe, Unarine, Milanzie, Mahmooda, Mogale, Tsholofelo Hope, Sindane, Thapelo, Abdulrasaq, Zainab, Mokgosi, Kesego, Okorie, Chijioke, Van Wyk, Nia Zion, Morrissey, Graham, Dunbar, Dale, Smit, Francois, Chidi, Tsosheletso, Mabuya, Rooweither, Bukula, Andiswa, Mlambo, Respect, Macucwa, Tebogo, Abdulmumin, Idris, Rananga, and Seani
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2512.02201
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914262530654208
author	Marivate, Vukosi Olaleye, Kayode Mundia, Sitwala Bakainga, Andinda Netshifhefhe, Unarine Milanzie, Mahmooda Mogale, Tsholofelo Hope Sindane, Thapelo Abdulrasaq, Zainab Mokgosi, Kesego Okorie, Chijioke Van Wyk, Nia Zion Morrissey, Graham Dunbar, Dale Smit, Francois Chidi, Tsosheletso Mabuya, Rooweither Bukula, Andiswa Mlambo, Respect Macucwa, Tebogo Abdulmumin, Idris Rananga, and Seani
author_facet	Marivate, Vukosi Olaleye, Kayode Mundia, Sitwala Bakainga, Andinda Netshifhefhe, Unarine Milanzie, Mahmooda Mogale, Tsholofelo Hope Sindane, Thapelo Abdulrasaq, Zainab Mokgosi, Kesego Okorie, Chijioke Van Wyk, Nia Zion Morrissey, Graham Dunbar, Dale Smit, Francois Chidi, Tsosheletso Mabuya, Rooweither Bukula, Andiswa Mlambo, Respect Macucwa, Tebogo Abdulmumin, Idris Rananga, and Seani
contents	This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_02201
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Swivuriso: The South African Next Voices Multilingual Speech Dataset Marivate, Vukosi Olaleye, Kayode Mundia, Sitwala Bakainga, Andinda Netshifhefhe, Unarine Milanzie, Mahmooda Mogale, Tsholofelo Hope Sindane, Thapelo Abdulrasaq, Zainab Mokgosi, Kesego Okorie, Chijioke Van Wyk, Nia Zion Morrissey, Graham Dunbar, Dale Smit, Francois Chidi, Tsosheletso Mabuya, Rooweither Bukula, Andiswa Mlambo, Respect Macucwa, Tebogo Abdulmumin, Idris Rananga, and Seani Computation and Language This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
title	Swivuriso: The South African Next Voices Multilingual Speech Dataset
topic	Computation and Language
url	https://arxiv.org/abs/2512.02201

Similar Items