Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lotz, Jonas F., Setiawan, Hendra, Peitz, Stephan, Kementchedjhieva, Yova
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2504.02122
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911099842985984
author	Lotz, Jonas F. Setiawan, Hendra Peitz, Stephan Kementchedjhieva, Yova
author_facet	Lotz, Jonas F. Setiawan, Hendra Peitz, Stephan Kementchedjhieva, Yova
contents	Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_02122
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Overcoming Vocabulary Constraints with Pixel-level Fallback Lotz, Jonas F. Setiawan, Hendra Peitz, Stephan Kementchedjhieva, Yova Computation and Language Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
title	Overcoming Vocabulary Constraints with Pixel-level Fallback
topic	Computation and Language
url	https://arxiv.org/abs/2504.02122

Similar Items