Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Crystina, Lu, Jing, Tran, Vinh Q., Schuster, Tal, Metzler, Donald, Lin, Jimmy
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2411.04530
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909912430280704
author	Zhang, Crystina Lu, Jing Tran, Vinh Q. Schuster, Tal Metzler, Donald Lin, Jimmy
author_facet	Zhang, Crystina Lu, Jing Tran, Vinh Q. Schuster, Tal Metzler, Donald Lin, Jimmy
contents	Human understanding of text depends on general semantic concepts of words rather than their superficial forms. To what extent does our human intuition transfer to language models? In this work, we study the degree to which current multilingual language models (mLMs) understand based on subword-level semantic concepts. To this end, we form "semantic tokens" by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on five heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections of the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we find that the zero-shot results with semantic tokens are on par with or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transfer.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_04530
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts? Zhang, Crystina Lu, Jing Tran, Vinh Q. Schuster, Tal Metzler, Donald Lin, Jimmy Computation and Language Human understanding of text depends on general semantic concepts of words rather than their superficial forms. To what extent does our human intuition transfer to language models? In this work, we study the degree to which current multilingual language models (mLMs) understand based on subword-level semantic concepts. To this end, we form "semantic tokens" by merging the semantically similar subwords and their embeddings, and evaluate the updated mLMs on five heterogeneous multilingual downstream tasks. Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes. Inspections of the grouped subwords show that they exhibit a wide range of semantic similarities, including synonyms and translations across many languages and scripts. Lastly, we find that the zero-shot results with semantic tokens are on par with or even better than the original models on certain classification tasks, suggesting that the shared subword-level semantics may serve as the anchors for cross-lingual transfer.
title	Tomato, Tomahto, Tomate: Do Multilingual Language Models Understand Based on Subword-Level Semantic Concepts?
topic	Computation and Language
url	https://arxiv.org/abs/2411.04530

Similar Items