Saved in:
| Main Authors: | Rahman, Hanif, Rehman, Shafeeq ur |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.27021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
by: Rahman, Hanif
Published: (2026)
by: Rahman, Hanif
Published: (2026)
Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation
by: Rahman, Hanif
Published: (2026)
by: Rahman, Hanif
Published: (2026)
Fine-tuning Whisper for Pashto ASR: strategies and scale
by: Rahman, Hanif
Published: (2026)
by: Rahman, Hanif
Published: (2026)
PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech
by: Rahman, Hanif
Published: (2026)
by: Rahman, Hanif
Published: (2026)
From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset
by: Jahani, Jandad, et al.
Published: (2026)
by: Jahani, Jandad, et al.
Published: (2026)
Connecting Voices: LoReSpeech as a Low-Resource Speech Parallel Corpus
by: Ouzerrout, Samy
Published: (2025)
by: Ouzerrout, Samy
Published: (2025)
Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus
by: Oberkircher, Lena S., et al.
Published: (2026)
by: Oberkircher, Lena S., et al.
Published: (2026)
Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus
by: Ortega, John E., et al.
Published: (2026)
by: Ortega, John E., et al.
Published: (2026)
Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language
by: Krsteski, Stefan, et al.
Published: (2025)
by: Krsteski, Stefan, et al.
Published: (2025)
SloPal: A 60-Million-Word Slovak Parliamentary Corpus with Aligned Speech and Fine-Tuned ASR Models
by: Božík, Erik, et al.
Published: (2025)
by: Božík, Erik, et al.
Published: (2025)
Developing an Open Conversational Speech Corpus for the Isan Language
by: Na-Thalang, Adisai, et al.
Published: (2025)
by: Na-Thalang, Adisai, et al.
Published: (2025)
Building Efficient and Effective OpenQA Systems for Low-Resource Languages
by: Budur, Emrah, et al.
Published: (2024)
by: Budur, Emrah, et al.
Published: (2024)
SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages
by: Xu, Tianyi, et al.
Published: (2026)
by: Xu, Tianyi, et al.
Published: (2026)
OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
by: Merx, Raphaël, et al.
Published: (2025)
by: Merx, Raphaël, et al.
Published: (2025)
Quechua Speech Datasets in Common Voice: The Case of Puno Quechua
by: Huaman, Elwin, et al.
Published: (2025)
by: Huaman, Elwin, et al.
Published: (2025)
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
by: Kargaran, Amir Hossein, et al.
Published: (2024)
by: Kargaran, Amir Hossein, et al.
Published: (2024)
Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages
by: Li, Chin-Jou, et al.
Published: (2025)
by: Li, Chin-Jou, et al.
Published: (2025)
CorpusQA: A 10 Million Token Benchmark for Corpus-Level Analysis and Reasoning
by: Lu, Zhiyuan, et al.
Published: (2026)
by: Lu, Zhiyuan, et al.
Published: (2026)
Speech-to-Speech Translation Pipelines for Conversations in Low-Resource Languages
by: Popescu-Belis, Andrei, et al.
Published: (2025)
by: Popescu-Belis, Andrei, et al.
Published: (2025)
Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization
by: Tomashenko, Natalia, et al.
Published: (2024)
by: Tomashenko, Natalia, et al.
Published: (2024)
Low-Resource Safety Failures Are Action Failures, Not Representation Failures
by: Aziz, Rashad, et al.
Published: (2026)
by: Aziz, Rashad, et al.
Published: (2026)
MaiBERT: A Pre-training Corpus and Language Model for Low-Resourced Maithili Language
by: Yadav, Sumit, et al.
Published: (2025)
by: Yadav, Sumit, et al.
Published: (2025)
ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis
by: Toyin, Hawau Olamide, et al.
Published: (2025)
by: Toyin, Hawau Olamide, et al.
Published: (2025)
Enhancing Voice Wake-Up for Dysarthria: Mandarin Dysarthria Speech Corpus Release and Customized System Design
by: Gao, Ming, et al.
Published: (2024)
by: Gao, Ming, et al.
Published: (2024)
FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions
by: Teixeira, Francisco, et al.
Published: (2026)
by: Teixeira, Francisco, et al.
Published: (2026)
DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing
by: Monir, Nasser-Eddine, et al.
Published: (2026)
by: Monir, Nasser-Eddine, et al.
Published: (2026)
Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification
by: D, Anusha M, et al.
Published: (2025)
by: D, Anusha M, et al.
Published: (2025)
Assessing the Feasibility of Lightweight Whisper Models for Low-Resource Urdu Transcription
by: Antall, Abdul Rehman, et al.
Published: (2025)
by: Antall, Abdul Rehman, et al.
Published: (2025)
Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback
by: Knill, Kate, et al.
Published: (2024)
by: Knill, Kate, et al.
Published: (2024)
Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition
by: Zhu, Haohao, et al.
Published: (2025)
by: Zhu, Haohao, et al.
Published: (2025)
Building a Non-native Speech Corpus Featuring Chinese-English Bilingual Children: Compilation and Rationale
by: Hung, Hiuchung, et al.
Published: (2023)
by: Hung, Hiuchung, et al.
Published: (2023)
CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech
by: Saidi, Youssef, et al.
Published: (2026)
by: Saidi, Youssef, et al.
Published: (2026)
FeruzaSpeech: A 60 Hour Uzbek Read Speech Corpus with Punctuation, Casing, and Context
by: Povey, Anna, et al.
Published: (2024)
by: Povey, Anna, et al.
Published: (2024)
The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages
by: Mutisya, Hillary, et al.
Published: (2026)
by: Mutisya, Hillary, et al.
Published: (2026)
Quantifying Geospatial in the Common Crawl Corpus
by: Ilyankou, Ilya, et al.
Published: (2024)
by: Ilyankou, Ilya, et al.
Published: (2024)
GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
Mangosteen: An Open Thai Corpus for Language Model Pretraining
by: Phatthiyaphaibun, Wannaphong, et al.
Published: (2025)
by: Phatthiyaphaibun, Wannaphong, et al.
Published: (2025)
BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
by: Basher, Mohammad Jahid Ibna, et al.
Published: (2025)
by: Basher, Mohammad Jahid Ibna, et al.
Published: (2025)
Building a Large Japanese Web Corpus for Large Language Models
by: Okazaki, Naoaki, et al.
Published: (2024)
by: Okazaki, Naoaki, et al.
Published: (2024)
Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus
by: Joshi, Raviraj, et al.
Published: (2024)
by: Joshi, Raviraj, et al.
Published: (2024)
Similar Items
-
PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development
by: Rahman, Hanif
Published: (2026) -
Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation
by: Rahman, Hanif
Published: (2026) -
Fine-tuning Whisper for Pashto ASR: strategies and scale
by: Rahman, Hanif
Published: (2026) -
PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech
by: Rahman, Hanif
Published: (2026) -
From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset
by: Jahani, Jandad, et al.
Published: (2026)