Saved in:
Bibliographic Details
Main Authors: Koluguri, Nithin Rao, Sekoyan, Monica, Zelenfroynd, George, Meister, Sasha, Ding, Shuoyang, Kostandian, Sofia, Huang, He, Karpov, Nikolay, Balam, Jagadeesh, Lavrukhin, Vitaly, Peng, Yifan, Papi, Sara, Gaido, Marco, Brutti, Alessio, Ginsburg, Boris
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.13404
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916749047234560
author Koluguri, Nithin Rao
Sekoyan, Monica
Zelenfroynd, George
Meister, Sasha
Ding, Shuoyang
Kostandian, Sofia
Huang, He
Karpov, Nikolay
Balam, Jagadeesh
Lavrukhin, Vitaly
Peng, Yifan
Papi, Sara
Gaido, Marco
Brutti, Alessio
Ginsburg, Boris
author_facet Koluguri, Nithin Rao
Sekoyan, Monica
Zelenfroynd, George
Meister, Sasha
Ding, Shuoyang
Kostandian, Sofia
Huang, He
Karpov, Nikolay
Balam, Jagadeesh
Lavrukhin, Vitaly
Peng, Yifan
Papi, Sara
Gaido, Marco
Brutti, Alessio
Ginsburg, Boris
contents Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary
format Preprint
id arxiv_https___arxiv_org_abs_2505_13404
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Granary: Speech Recognition and Translation Dataset in 25 European Languages
Koluguri, Nithin Rao
Sekoyan, Monica
Zelenfroynd, George
Meister, Sasha
Ding, Shuoyang
Kostandian, Sofia
Huang, He
Karpov, Nikolay
Balam, Jagadeesh
Lavrukhin, Vitaly
Peng, Yifan
Papi, Sara
Gaido, Marco
Brutti, Alessio
Ginsburg, Boris
Computation and Language
Audio and Speech Processing
Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary
title Granary: Speech Recognition and Translation Dataset in 25 European Languages
topic Computation and Language
Audio and Speech Processing
url https://arxiv.org/abs/2505.13404