Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2308.02161 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913532602220544 |
|---|---|
| author | Moon, Jiyong Lee, Junseok Lee, Yunju Park, Seongsik |
| author_facet | Moon, Jiyong Lee, Junseok Lee, Yunju Park, Seongsik |
| contents | Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions. Compared to previous single-scale patch selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2308_02161 |
| institution | arXiv |
| publishDate | 2023 |
| record_format | arxiv |
| spellingShingle | M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition Moon, Jiyong Lee, Junseok Lee, Yunju Park, Seongsik Computer Vision and Pattern Recognition Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions. Compared to previous single-scale patch selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks. |
| title | M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2308.02161 |