Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Vo, Hung Q., Yuan, Pengyu, Yin, Zheng, Wong, Kelvin K., Ezeana, Chika F., Ly, Son T., Nguyen, Hien V., Wong, Stephen T. C.
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.07157
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915748364943360
author	Vo, Hung Q. Yuan, Pengyu Yin, Zheng Wong, Kelvin K. Ezeana, Chika F. Ly, Son T. Nguyen, Hien V. Wong, Stephen T. C.
author_facet	Vo, Hung Q. Yuan, Pengyu Yin, Zheng Wong, Kelvin K. Ezeana, Chika F. Ly, Son T. Nguyen, Hien V. Wong, Stephen T. C.
contents	Self-supervised learning (SSL) with Vision Transformers (ViT) has shown immense potential in medical image analysis. However, the quadratic complexity ($\mathcal{O}(N^2)$) of standard self-attention poses a severe barrier for high-resolution biomedical tasks, effectively excluding resource-constrained research labs from utilizing state-of-the-art models. To address this computational bottleneck without sacrificing diagnostic accuracy, we propose \textbf{MIRAM}, a Multi-scale Masked Autoencoder that leverages a \textbf{hybrid-attention mechanism}. Our architecture uniquely decouples semantic learning from detail reconstruction using a dual-decoder design: a standard transformer decoder captures global semantics at low resolution, while a linear-complexity decoder (comparing Linformer, Performer, and Nyströmformer) handles the computationally expensive high-resolution reconstruction. This reduces the complexity of the upscaling stage from quadratic to linear ($\mathcal{O}(N)$), enabling high-fidelity training on consumer-grade GPUs. We validate our approach on the CBIS-DDSM mammography dataset. Remarkably, our \textbf{Nyströmformer-based variant} achieves a classification accuracy of \textbf{61.0\%}, outperforming both standard MAE (58.9\%) and MoCo-v3 (60.2\%) while requiring significantly less memory. These results demonstrate that hybrid-attention architectures can democratize high-resolution medical AI, making powerful SSL accessible to researchers with limited hardware resources.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_07157
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Efficient Multi-scale Masked Autoencoders with Hybrid-Attention Mechanism for Breast Lesion Classification Vo, Hung Q. Yuan, Pengyu Yin, Zheng Wong, Kelvin K. Ezeana, Chika F. Ly, Son T. Nguyen, Hien V. Wong, Stephen T. C. Computer Vision and Pattern Recognition Self-supervised learning (SSL) with Vision Transformers (ViT) has shown immense potential in medical image analysis. However, the quadratic complexity ($\mathcal{O}(N^2)$) of standard self-attention poses a severe barrier for high-resolution biomedical tasks, effectively excluding resource-constrained research labs from utilizing state-of-the-art models. To address this computational bottleneck without sacrificing diagnostic accuracy, we propose \textbf{MIRAM}, a Multi-scale Masked Autoencoder that leverages a \textbf{hybrid-attention mechanism}. Our architecture uniquely decouples semantic learning from detail reconstruction using a dual-decoder design: a standard transformer decoder captures global semantics at low resolution, while a linear-complexity decoder (comparing Linformer, Performer, and Nyströmformer) handles the computationally expensive high-resolution reconstruction. This reduces the complexity of the upscaling stage from quadratic to linear ($\mathcal{O}(N)$), enabling high-fidelity training on consumer-grade GPUs. We validate our approach on the CBIS-DDSM mammography dataset. Remarkably, our \textbf{Nyströmformer-based variant} achieves a classification accuracy of \textbf{61.0\%}, outperforming both standard MAE (58.9\%) and MoCo-v3 (60.2\%) while requiring significantly less memory. These results demonstrate that hybrid-attention architectures can democratize high-resolution medical AI, making powerful SSL accessible to researchers with limited hardware resources.
title	Efficient Multi-scale Masked Autoencoders with Hybrid-Attention Mechanism for Breast Lesion Classification
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.07157

Similar Items