Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.12587 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913845711208448 |
|---|---|
| author | Baral, Aditeya Ajith, Allen George Nayak, Roshan Bhanja, Mrityunjay Abhijeet |
| author_facet | Baral, Aditeya Ajith, Allen George Nayak, Roshan Bhanja, Mrityunjay Abhijeet |
| contents | Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2505_12587 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling Baral, Aditeya Ajith, Allen George Nayak, Roshan Bhanja, Mrityunjay Abhijeet Computation and Language Machine Learning Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages. |
| title | CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling |
| topic | Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2505.12587 |