Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Koluguri, Nithin Rao, Sekoyan, Monica, Zelenfroynd, George, Meister, Sasha, Ding, Shuoyang, Kostandian, Sofia, Huang, He, Karpov, Nikolay, Balam, Jagadeesh, Lavrukhin, Vitaly, Peng, Yifan, Papi, Sara, Gaido, Marco, Brutti, Alessio, Ginsburg, Boris
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2505.13404
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916749047234560
author	Koluguri, Nithin Rao Sekoyan, Monica Zelenfroynd, George Meister, Sasha Ding, Shuoyang Kostandian, Sofia Huang, He Karpov, Nikolay Balam, Jagadeesh Lavrukhin, Vitaly Peng, Yifan Papi, Sara Gaido, Marco Brutti, Alessio Ginsburg, Boris
author_facet	Koluguri, Nithin Rao Sekoyan, Monica Zelenfroynd, George Meister, Sasha Ding, Shuoyang Kostandian, Sofia Huang, He Karpov, Nikolay Balam, Jagadeesh Lavrukhin, Vitaly Peng, Yifan Papi, Sara Gaido, Marco Brutti, Alessio Ginsburg, Boris
contents	Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_13404
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Granary: Speech Recognition and Translation Dataset in 25 European Languages Koluguri, Nithin Rao Sekoyan, Monica Zelenfroynd, George Meister, Sasha Ding, Shuoyang Kostandian, Sofia Huang, He Karpov, Nikolay Balam, Jagadeesh Lavrukhin, Vitaly Peng, Yifan Papi, Sara Gaido, Marco Brutti, Alessio Ginsburg, Boris Computation and Language Audio and Speech Processing Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary
title	Granary: Speech Recognition and Translation Dataset in 25 European Languages
topic	Computation and Language Audio and Speech Processing
url	https://arxiv.org/abs/2505.13404

Similar Items