Guardado en:
| Autores principales: | , , , , , , , |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2512.08143 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
| _version_ | 1866908704201244672 |
|---|---|
| author | Rezaabad, Ali Lotfi Khanal, Bikram Chaurasia, Shashwat Zeng, Lu Hong, Dezhi Bashashati, Hossein Butler, Thomas Ganji, Megan |
| author_facet | Rezaabad, Ali Lotfi Khanal, Bikram Chaurasia, Shashwat Zeng, Lu Hong, Dezhi Bashashati, Hossein Butler, Thomas Ganji, Megan |
| contents | Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_08143 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection Rezaabad, Ali Lotfi Khanal, Bikram Chaurasia, Shashwat Zeng, Lu Hong, Dezhi Bashashati, Hossein Butler, Thomas Ganji, Megan Machine Learning Language identification is a crucial first step in multilingual systems such as chatbots and virtual assistants, enabling linguistically and culturally accurate user experiences. Errors at this stage can cascade into downstream failures, setting a high bar for accuracy. Yet, existing language identification tools struggle with key cases -- such as music requests where the song title and user language differ. Open-source tools like LangDetect, FastText are fast but less accurate, while large language models, though effective, are often too costly for low-latency or low-resource settings. We introduce PolyLingua, a lightweight Transformer-based model for in-domain language detection and fine-grained language classification. It employs a two-level contrastive learning framework combining instance-level separation and class-level alignment with adaptive margins, yielding compact and well-separated embeddings even for closely related languages. Evaluated on two challenging datasets -- Amazon Massive (multilingual digital assistant utterances) and a Song dataset (music requests with frequent code-switching) -- PolyLingua achieves 99.25% F1 and 98.15% F1, respectively, surpassing Sonnet 3.5 while using 10x fewer parameters, making it ideal for compute- and latency-constrained environments. |
| title | PolyLingua: Margin-based Inter-class Transformer for Robust Cross-domain Language Detection |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2512.08143 |