Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Zhi-Kai, Jiang, Jun-Peng, Ye, Han-Jia, Zhan, De-Chuan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2510.25739
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912676694720512
author	Chen, Zhi-Kai Jiang, Jun-Peng Ye, Han-Jia Zhan, De-Chuan
author_facet	Chen, Zhi-Kai Jiang, Jun-Peng Ye, Han-Jia Zhan, De-Chuan
contents	Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_25739
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation Chen, Zhi-Kai Jiang, Jun-Peng Ye, Han-Jia Zhan, De-Chuan Computer Vision and Pattern Recognition Machine Learning Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions. Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models, while preserving both image fidelity and diversity.
title	Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2510.25739

Similar Items