Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Fuller, Anthony, Yassin, Yousef, Kyrollos, Daniel G., Shelhamer, Evan, Green, James R.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.15021
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917300853014528
author	Fuller, Anthony Yassin, Yousef Kyrollos, Daniel G. Shelhamer, Evan Green, James R.
author_facet	Fuller, Anthony Yassin, Yousef Kyrollos, Daniel G. Shelhamer, Evan Green, James R.
contents	ViTs are general and accurate, and address many tasks, but ViTs are slow, and are not always practical when efficiency is key. Existing methods for faster ViTs design hybrid non-ViT architectures, losing generality, or shrink their tokens, sacrificing accuracy. Many non-ViT architectures are both fast and accurate. Yet, without significant modifications, they cannot do what ViTs can: process other input shapes, pre-train by SOTA self-supervised learning, reduce computation by dropping tokens, and more. We make ViTs faster by reducing patch token width while increasing global token width by adding a new Jumbo token. Our wider Jumbo token is processed by its own wider FFN to increase model capacity. Yet our Jumbo FFN is efficient: it processes a single token, for speed, and its parameters are shared across all layers, for memory. Crucially, our Jumbo is attention-only and non-hierarchical, like a plain ViT, so it is simple, scalable, flexible, and compatible with ViT methods new and old. Jumbo improves over ViT baselines with Registers from Nano to Large scales while maintaining speed/throughput on ImageNet-1K (0.1-13%). Jumbo also improves segmentation (1.9-3.1% on ADE20K), MAE pre-training (4.9% linear probing on ImageNet-1K), test-time adaptation (5.2% on ImageNet-C), and time series modeling. Our Jumbo models even achieve better speed-accuracy trade-offs than specialized non-ViT compute-efficient models, while maintaining plain-ViT compatibility for practicality. Code and weights are available: https://github.com/antofuller/jumbo
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_15021
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers Fuller, Anthony Yassin, Yousef Kyrollos, Daniel G. Shelhamer, Evan Green, James R. Computer Vision and Pattern Recognition ViTs are general and accurate, and address many tasks, but ViTs are slow, and are not always practical when efficiency is key. Existing methods for faster ViTs design hybrid non-ViT architectures, losing generality, or shrink their tokens, sacrificing accuracy. Many non-ViT architectures are both fast and accurate. Yet, without significant modifications, they cannot do what ViTs can: process other input shapes, pre-train by SOTA self-supervised learning, reduce computation by dropping tokens, and more. We make ViTs faster by reducing patch token width while increasing global token width by adding a new Jumbo token. Our wider Jumbo token is processed by its own wider FFN to increase model capacity. Yet our Jumbo FFN is efficient: it processes a single token, for speed, and its parameters are shared across all layers, for memory. Crucially, our Jumbo is attention-only and non-hierarchical, like a plain ViT, so it is simple, scalable, flexible, and compatible with ViT methods new and old. Jumbo improves over ViT baselines with Registers from Nano to Large scales while maintaining speed/throughput on ImageNet-1K (0.1-13%). Jumbo also improves segmentation (1.9-3.1% on ADE20K), MAE pre-training (4.9% linear probing on ImageNet-1K), test-time adaptation (5.2% on ImageNet-C), and time series modeling. Our Jumbo models even achieve better speed-accuracy trade-offs than specialized non-ViT compute-efficient models, while maintaining plain-ViT compatibility for practicality. Code and weights are available: https://github.com/antofuller/jumbo
title	Thicker and Quicker: A Jumbo Token for Fast Plain Vision Transformers
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.15021

Similar Items