Saved in:
Bibliographic Details
Main Authors: Baral, Aditeya, Ajith, Allen George, Nayak, Roshan, Bhanja, Mrityunjay Abhijeet
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.12587
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913845711208448
author Baral, Aditeya
Ajith, Allen George
Nayak, Roshan
Bhanja, Mrityunjay Abhijeet
author_facet Baral, Aditeya
Ajith, Allen George
Nayak, Roshan
Bhanja, Mrityunjay Abhijeet
contents Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages.
format Preprint
id arxiv_https___arxiv_org_abs_2505_12587
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling
Baral, Aditeya
Ajith, Allen George
Nayak, Roshan
Bhanja, Mrityunjay Abhijeet
Computation and Language
Machine Learning
Code-mixed languages, characterized by frequent within-sentence language transitions, present structural challenges that standard language models fail to address. In this work, we propose CMLFormer, an enhanced multi-layer dual-decoder Transformer with a shared encoder and synchronized decoder cross-attention, designed to model the linguistic and semantic dynamics of code-mixed text. CMLFormer is pre-trained on an augmented Hinglish corpus with switching point and translation annotations with multiple new objectives specifically aimed at capturing switching behavior, cross-lingual structure, and code-mixing complexity. Our experiments show that CMLFormer improves F1 score, precision, and accuracy over other approaches on the HASOC-2021 benchmark under select pre-training setups. Attention analyses further show that it can identify and attend to switching points, validating its sensitivity to code-mixed structure. These results demonstrate the effectiveness of CMLFormer's architecture and multi-task pre-training strategy for modeling code-mixed languages.
title CMLFormer: A Dual Decoder Transformer with Switching Point Learning for Code-Mixed Language Modeling
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2505.12587