Saved in:
Bibliographic Details
Main Authors: Hu, Chen, Tai, Yintao, Vergari, Antonio, Keller, Frank, Suglia, Alessandro
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.11575
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910125477855232
author Hu, Chen
Tai, Yintao
Vergari, Antonio
Keller, Frank
Suglia, Alessandro
author_facet Hu, Chen
Tai, Yintao
Vergari, Antonio
Keller, Frank
Suglia, Alessandro
contents Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.
format Preprint
id arxiv_https___arxiv_org_abs_2604_11575
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
Hu, Chen
Tai, Yintao
Vergari, Antonio
Keller, Frank
Suglia, Alessandro
Computation and Language
Pixel-based language models are gaining momentum as alternatives to traditional token-based approaches, promising to circumvent tokenization challenges. However, the inherent perceptual diversity across languages poses a significant hurdle for multilingual generalization in pixel space. This paper introduces MIXAR, the first generative pixel-based language model trained on eight different languages utilizing a range of different scripts. We empirically evaluate MIXAR against previous pixel-based models as well as comparable tokenizer-based models, demonstrating substantial performance improvement on discriminative and generative multilingual tasks. Additionally, we show how MIXAR is robust to languages never seen during the training. These results are further strengthened when scaling the model to 0.5B parameters which not only improves its capabilities in generative tasks like LAMBADA but also its robustness when challenged with input perturbations such as orthographic attacks.
title MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
topic Computation and Language
url https://arxiv.org/abs/2604.11575