Saved in:
Bibliographic Details
Main Authors: Suarez, Pedro Ortiz, Burchell, Laurie, Arnett, Catherine, Mosquera-Gómez, Rafael, Hincapie-Monsalve, Sara, Vaughan, Thom, Stewart, Damian, Ostendorff, Malte, Abdulmumin, Idris, Marivate, Vukosi, Muhammad, Shamsuddeen Hassan, Tonja, Atnafu Lambebo, Al-Khalifa, Hend, Hammouda, Nadia Ghezaiel, Otiende, Verrah, Wong, Tack Hwa, Saydaliev, Jakhongir, Nobakhtian, Melika, Habibi, Muhammad Ravi Shulthan, Kranti, Chalamalasetti, Muchemi, Carol, Nguyen, Khang, Adam, Faisal Muhammad, Salim, Luis Frentzen, Alqifari, Reem, Amol, Cynthia, Imperial, Joseph Marvin, Kesen, Ilker, Mustafid, Ahmad, Stepachev, Pavel, Choshen, Leshem, Anugraha, David, Nayel, Hamada, Yimam, Seid Muhie, Putra, Vallerie Alexandra, Nguyen, My Chiffon, Wasi, Azmine Toushik, Vadithya, Gouthami, van der Goot, Rob, C'horr, Lanwenn ar, Dua, Karan, Yates, Andrew, Bangera, Mithil, Bangera, Yeshil, Patel, Hitesh Laxmichand, Okabe, Shu, Ilasariya, Fenal Ashokbhai, Gaynullin, Dmitry, Winata, Genta Indra, Li, Yiyuan, Martínez, Juan Pablo, Agarwal, Amit, Hanif, Ikhlasul Akmal, Ahmad, Raia Abu, Adenuga, Esther, Tjiaranata, Filbert Aurelian, Buaphet, Weerayut, Anugraha, Michael, Vajjala, Sowmya, Rice, Benjamin, Amirudin, Azril Hafizi, Alabi, Jesujoba O., Panda, Srikant, Toughrai, Yassine, Kyomuhendo, Bruhan, Ruffinelli, Daniel, A, Akshata, Goulão, Manuel, Zhou, Ej, Ramirez, Ingrid Gabriela Franco, Aggazzotti, Cristina, Dobler, Konstantin, Kevin, Jun, Pagès, Quentin, Andrews, Nicholas, Ibrahim, Nuhu, Ruckdeschel, Mattes, Keleg, Amr, Zhang, Mike, Muziri, Casper, Samuel, Saron, Takeshita, Sotaro, Kerdthaisong, Kun, Foppiano, Luca, Dent, Rasul, Green, Tommaso, Wali, Ahmad Mustapha, Makaaka, Kamohelo, Feliren, Vicky, Idris, Inshirah, Celikkanat, Hande, Abubakar, Abdulhamid, Maillard, Jean, Sagot, Benoît, Clérice, Thibault, Murray, Kenton, Luger, Sarah
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.18026
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.