Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhu, Guangyang, Zhang, Jianfeng, Feng, Yuanzhi, Lan, Hai
Formato:	Preprint
Publicado:	2022
Materias:	Computer Vision and Pattern Recognition Machine Learning
Acceso en línea:	https://arxiv.org/abs/2201.01410
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866917788786884608
author	Zhu, Guangyang Zhang, Jianfeng Feng, Yuanzhi Lan, Hai
author_facet	Zhu, Guangyang Zhang, Jianfeng Feng, Yuanzhi Lan, Hai
contents	Self-attention module shows outstanding competence in capturing long-range relationships while enhancing performance on vision tasks, such as image classification and image captioning. However, the self-attention module highly relies on the dot product multiplication and dimension alignment among query-key-value features, which cause two problems: (1) The dot product multiplication results in exhaustive and redundant computation. (2) Due to the visual feature map often appearing as a multi-dimensional tensor, reshaping the scale of the tensor feature to adapt to the dimension alignment might destroy the internal structure of the tensor feature map. To address these problems, this paper proposes a self-attention plug-in module with its variants, namely, Synthesizing Tensor Transformations (STT), for directly processing image tensor features. Without computing the dot-product multiplication among query-key-value, the basic STT is composed of the tensor transformation to learn the synthetic attention weight from visual information. The effectiveness of STT series is validated on the image classification and image caption. Experiments show that the proposed STT achieves competitive performance while keeping robustness compared to self-attention in the aforementioned vision tasks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2201_01410
institution	arXiv
publishDate	2022
record_format	arxiv
spellingShingle	Synthesizer Based Efficient Self-Attention for Vision Tasks Zhu, Guangyang Zhang, Jianfeng Feng, Yuanzhi Lan, Hai Computer Vision and Pattern Recognition Machine Learning Self-attention module shows outstanding competence in capturing long-range relationships while enhancing performance on vision tasks, such as image classification and image captioning. However, the self-attention module highly relies on the dot product multiplication and dimension alignment among query-key-value features, which cause two problems: (1) The dot product multiplication results in exhaustive and redundant computation. (2) Due to the visual feature map often appearing as a multi-dimensional tensor, reshaping the scale of the tensor feature to adapt to the dimension alignment might destroy the internal structure of the tensor feature map. To address these problems, this paper proposes a self-attention plug-in module with its variants, namely, Synthesizing Tensor Transformations (STT), for directly processing image tensor features. Without computing the dot-product multiplication among query-key-value, the basic STT is composed of the tensor transformation to learn the synthetic attention weight from visual information. The effectiveness of STT series is validated on the image classification and image caption. Experiments show that the proposed STT achieves competitive performance while keeping robustness compared to self-attention in the aforementioned vision tasks.
title	Synthesizer Based Efficient Self-Attention for Vision Tasks
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2201.01410

Ejemplares similares