Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Tai, Yintao, Liao, Xiyang, Suglia, Alessandro, Vergari, Antonio
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2401.03321
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911783527120896
author	Tai, Yintao Liao, Xiyang Suglia, Alessandro Vergari, Antonio
author_facet	Tai, Yintao Liao, Xiyang Suglia, Alessandro Vergari, Antonio
contents	Recent work showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text. However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text. Therefore, they cannot be used for generative tasks such as free-form question answering. In this work, we introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation. Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models. Furthermore, we highlight the challenges of generating text as non-noisy images and show this is due to using a maximum likelihood objective. To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI -- making it comparable to GPT-2 on text generation tasks. This paves the way to build open-vocabulary LLMs that operate on perceptual input only and calls into question the necessity of the usual symbolic input representation, i.e., text as (sub)tokens.
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_03321
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	PIXAR: Auto-Regressive Language Modeling in Pixel Space Tai, Yintao Liao, Xiyang Suglia, Alessandro Vergari, Antonio Computation and Language Recent work showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text. However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text. Therefore, they cannot be used for generative tasks such as free-form question answering. In this work, we introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation. Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models. Furthermore, we highlight the challenges of generating text as non-noisy images and show this is due to using a maximum likelihood objective. To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI -- making it comparable to GPT-2 on text generation tasks. This paves the way to build open-vocabulary LLMs that operate on perceptual input only and calls into question the necessity of the usual symbolic input representation, i.e., text as (sub)tokens.
title	PIXAR: Auto-Regressive Language Modeling in Pixel Space
topic	Computation and Language
url	https://arxiv.org/abs/2401.03321

Similar Items