Enregistré dans:
Détails bibliographiques
Auteurs principaux: Kim, Jeongsoo, Nang, Jongho, Choe, Junsuk
Format: Preprint
Publié: 2024
Sujets:
Accès en ligne:https://arxiv.org/abs/2409.03516
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866909306387955712
author Kim, Jeongsoo
Nang, Jongho
Choe, Junsuk
author_facet Kim, Jeongsoo
Nang, Jongho
Choe, Junsuk
contents Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at https://github.com/jwgdmkj/LMLT.
format Preprint
id arxiv_https___arxiv_org_abs_2409_03516
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution
Kim, Jeongsoo
Nang, Jongho
Choe, Junsuk
Computer Vision and Pattern Recognition
Artificial Intelligence
Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at https://github.com/jwgdmkj/LMLT.
title LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2409.03516