Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Guidez, Martial, Duffner, Stefan, Garcia, Christophe
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2602.24159
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908856104255488
author	Guidez, Martial Duffner, Stefan Garcia, Christophe
author_facet	Guidez, Martial Duffner, Stefan Garcia, Christophe
contents	Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_24159
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	RAViT: Resolution-Adaptive Vision Transformer Guidez, Martial Duffner, Stefan Garcia, Christophe Computer Vision and Pattern Recognition Machine Learning Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first transformer and the resulting prediction is reused together with the original-size image to perform a final prediction on a second transformer with less computation than a classical Vision transformer architecture. The early-exit process allows the model to make a final prediction at intermediate branches, saving even more computation. We evaluated our approach on CIFAR-10, Tiny ImageNet, and ImageNet. We obtained an equivalent accuracy to the classical Vision transformer model with only around 70% of FLOPs.
title	RAViT: Resolution-Adaptive Vision Transformer
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2602.24159

Similar Items