Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Abdullah, Abdullah Nazhat, Aydin, Tarkan
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2403.02411
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912357279596544
author	Abdullah, Abdullah Nazhat Aydin, Tarkan
author_facet	Abdullah, Abdullah Nazhat Aydin, Tarkan
contents	The attention mechanism is the primary component of the transformer architecture; it has led to significant advancements in deep learning spanning many domains and covering multiple tasks. In computer vision, the attention mechanism was first incorporated in the Vision Transformer ViT, and then its usage has expanded into many tasks in the vision domain, such as classification, segmentation, object detection, and image generation. While the attention mechanism is very expressive and capable, it comes with the disadvantage of being computationally expensive and requiring datasets of considerable size for effective optimization. To address these shortcomings, many designs have been proposed in the literature to reduce the computational burden and alleviate the data size requirements. Examples of such attempts in the vision domain are the MLP-Mixer, the Conv-Mixer, the Perciver-IO, and many more attempts with different sets of advantages and disadvantages. This paper introduces a new computational block as an alternative to the standard ViT block. The newly proposed block reduces the computational requirements by replacing the normal attention layers with a Network in Network structure, therefore enhancing the static approach of the MLP-Mixer with a dynamic learning of element-wise gating function generated by a token mixing process. Extensive experimentation shows that the proposed design provides better performance than the baseline architectures on multiple datasets applied in the image classification task of the vision domain.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_02411
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	NiNformer: A Network in Network Transformer with Token Mixing Generated Gating Function Abdullah, Abdullah Nazhat Aydin, Tarkan Computer Vision and Pattern Recognition Machine Learning The attention mechanism is the primary component of the transformer architecture; it has led to significant advancements in deep learning spanning many domains and covering multiple tasks. In computer vision, the attention mechanism was first incorporated in the Vision Transformer ViT, and then its usage has expanded into many tasks in the vision domain, such as classification, segmentation, object detection, and image generation. While the attention mechanism is very expressive and capable, it comes with the disadvantage of being computationally expensive and requiring datasets of considerable size for effective optimization. To address these shortcomings, many designs have been proposed in the literature to reduce the computational burden and alleviate the data size requirements. Examples of such attempts in the vision domain are the MLP-Mixer, the Conv-Mixer, the Perciver-IO, and many more attempts with different sets of advantages and disadvantages. This paper introduces a new computational block as an alternative to the standard ViT block. The newly proposed block reduces the computational requirements by replacing the normal attention layers with a Network in Network structure, therefore enhancing the static approach of the MLP-Mixer with a dynamic learning of element-wise gating function generated by a token mixing process. Extensive experimentation shows that the proposed design provides better performance than the baseline architectures on multiple datasets applied in the image classification task of the vision domain.
title	NiNformer: A Network in Network Transformer with Token Mixing Generated Gating Function
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2403.02411

Similar Items