Saved in:
Bibliographic Details
Main Authors: Xiong, Jing, Liu, Gongye, Huang, Lun, Wu, Chengyue, Wu, Taiqiang, Mu, Yao, Yao, Yuan, Shen, Hui, Wan, Zhongwei, Huang, Jinfa, Tao, Chaofan, Yan, Shen, Yao, Huaxiu, Kong, Lingpeng, Yang, Hongxia, Zhang, Mi, Sapiro, Guillermo, Luo, Jiebo, Luo, Ping, Wong, Ngai
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2411.05902
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909629889380352
author Xiong, Jing
Liu, Gongye
Huang, Lun
Wu, Chengyue
Wu, Taiqiang
Mu, Yao
Yao, Yuan
Shen, Hui
Wan, Zhongwei
Huang, Jinfa
Tao, Chaofan
Yan, Shen
Yao, Huaxiu
Kong, Lingpeng
Yang, Hongxia
Zhang, Mi
Sapiro, Guillermo
Luo, Jiebo
Luo, Ping
Wong, Ngai
author_facet Xiong, Jing
Liu, Gongye
Huang, Lun
Wu, Chengyue
Wu, Taiqiang
Mu, Yao
Yao, Yuan
Shen, Hui
Wan, Zhongwei
Huang, Jinfa
Tao, Chaofan
Yan, Shen
Yao, Huaxiu
Kong, Lingpeng
Yang, Hongxia
Zhang, Mi
Sapiro, Guillermo
Luo, Jiebo
Luo, Ping
Wong, Ngai
contents Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the representation strategy. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.
format Preprint
id arxiv_https___arxiv_org_abs_2411_05902
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Autoregressive Models in Vision: A Survey
Xiong, Jing
Liu, Gongye
Huang, Lun
Wu, Chengyue
Wu, Taiqiang
Mu, Yao
Yao, Yuan
Shen, Hui
Wan, Zhongwei
Huang, Jinfa
Tao, Chaofan
Yan, Shen
Yao, Huaxiu
Kong, Lingpeng
Yang, Hongxia
Zhang, Mi
Sapiro, Guillermo
Luo, Jiebo
Luo, Ping
Wong, Ngai
Computer Vision and Pattern Recognition
Computation and Language
Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the representation strategy. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.
title Autoregressive Models in Vision: A Survey
topic Computer Vision and Pattern Recognition
Computation and Language
url https://arxiv.org/abs/2411.05902