Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yadav, Anjali, Garg, Tanya, Klemen, Matej, Ulcar, Matej, Agarwal, Basant, Sikonja, Marko Robnik
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2405.12929
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915242218356736
author	Yadav, Anjali Garg, Tanya Klemen, Matej Ulcar, Matej Agarwal, Basant Sikonja, Marko Robnik
author_facet	Yadav, Anjali Garg, Tanya Klemen, Matej Ulcar, Matej Agarwal, Basant Sikonja, Marko Robnik
contents	Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2405_12929
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Code-mixed Sentiment and Hate-speech Prediction Yadav, Anjali Garg, Tanya Klemen, Matej Ulcar, Matej Agarwal, Basant Sikonja, Marko Robnik Computation and Language Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data.
title	Code-mixed Sentiment and Hate-speech Prediction
topic	Computation and Language
url	https://arxiv.org/abs/2405.12929

Similar Items