Saved in:
Bibliographic Details
Main Authors: Lotz, Jonas F., Setiawan, Hendra, Peitz, Stephan, Kementchedjhieva, Yova
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.02122
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911099842985984
author Lotz, Jonas F.
Setiawan, Hendra
Peitz, Stephan
Kementchedjhieva, Yova
author_facet Lotz, Jonas F.
Setiawan, Hendra
Peitz, Stephan
Kementchedjhieva, Yova
contents Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
format Preprint
id arxiv_https___arxiv_org_abs_2504_02122
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Overcoming Vocabulary Constraints with Pixel-level Fallback
Lotz, Jonas F.
Setiawan, Hendra
Peitz, Stephan
Kementchedjhieva, Yova
Computation and Language
Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models with a vocabulary-free encoder that generates input embeddings from text rendered as pixels. Through experiments on English-centric language models, we demonstrate that our approach substantially improves machine translation performance and facilitates effective cross-lingual transfer, outperforming tokenizer-based methods. Furthermore, we find that pixel-based representations outperform byte-level approaches and standard vocabulary expansion. Our approach enhances the multilingual capabilities of monolingual language models without extensive retraining and reduces decoding latency via input compression.
title Overcoming Vocabulary Constraints with Pixel-level Fallback
topic Computation and Language
url https://arxiv.org/abs/2504.02122