Saved in:
Bibliographic Details
Main Authors: Vo, Hung Q., Yuan, Pengyu, Yin, Zheng, Wong, Kelvin K., Ezeana, Chika F., Ly, Son T., Nguyen, Hien V., Wong, Stephen T. C.
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.07157
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915748364943360
author Vo, Hung Q.
Yuan, Pengyu
Yin, Zheng
Wong, Kelvin K.
Ezeana, Chika F.
Ly, Son T.
Nguyen, Hien V.
Wong, Stephen T. C.
author_facet Vo, Hung Q.
Yuan, Pengyu
Yin, Zheng
Wong, Kelvin K.
Ezeana, Chika F.
Ly, Son T.
Nguyen, Hien V.
Wong, Stephen T. C.
contents Self-supervised learning (SSL) with Vision Transformers (ViT) has shown immense potential in medical image analysis. However, the quadratic complexity ($\mathcal{O}(N^2)$) of standard self-attention poses a severe barrier for high-resolution biomedical tasks, effectively excluding resource-constrained research labs from utilizing state-of-the-art models. To address this computational bottleneck without sacrificing diagnostic accuracy, we propose \textbf{MIRAM}, a Multi-scale Masked Autoencoder that leverages a \textbf{hybrid-attention mechanism}. Our architecture uniquely decouples semantic learning from detail reconstruction using a dual-decoder design: a standard transformer decoder captures global semantics at low resolution, while a linear-complexity decoder (comparing Linformer, Performer, and Nyströmformer) handles the computationally expensive high-resolution reconstruction. This reduces the complexity of the upscaling stage from quadratic to linear ($\mathcal{O}(N)$), enabling high-fidelity training on consumer-grade GPUs. We validate our approach on the CBIS-DDSM mammography dataset. Remarkably, our \textbf{Nyströmformer-based variant} achieves a classification accuracy of \textbf{61.0\%}, outperforming both standard MAE (58.9\%) and MoCo-v3 (60.2\%) while requiring significantly less memory. These results demonstrate that hybrid-attention architectures can democratize high-resolution medical AI, making powerful SSL accessible to researchers with limited hardware resources.
format Preprint
id arxiv_https___arxiv_org_abs_2503_07157
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Efficient Multi-scale Masked Autoencoders with Hybrid-Attention Mechanism for Breast Lesion Classification
Vo, Hung Q.
Yuan, Pengyu
Yin, Zheng
Wong, Kelvin K.
Ezeana, Chika F.
Ly, Son T.
Nguyen, Hien V.
Wong, Stephen T. C.
Computer Vision and Pattern Recognition
Self-supervised learning (SSL) with Vision Transformers (ViT) has shown immense potential in medical image analysis. However, the quadratic complexity ($\mathcal{O}(N^2)$) of standard self-attention poses a severe barrier for high-resolution biomedical tasks, effectively excluding resource-constrained research labs from utilizing state-of-the-art models. To address this computational bottleneck without sacrificing diagnostic accuracy, we propose \textbf{MIRAM}, a Multi-scale Masked Autoencoder that leverages a \textbf{hybrid-attention mechanism}. Our architecture uniquely decouples semantic learning from detail reconstruction using a dual-decoder design: a standard transformer decoder captures global semantics at low resolution, while a linear-complexity decoder (comparing Linformer, Performer, and Nyströmformer) handles the computationally expensive high-resolution reconstruction. This reduces the complexity of the upscaling stage from quadratic to linear ($\mathcal{O}(N)$), enabling high-fidelity training on consumer-grade GPUs. We validate our approach on the CBIS-DDSM mammography dataset. Remarkably, our \textbf{Nyströmformer-based variant} achieves a classification accuracy of \textbf{61.0\%}, outperforming both standard MAE (58.9\%) and MoCo-v3 (60.2\%) while requiring significantly less memory. These results demonstrate that hybrid-attention architectures can democratize high-resolution medical AI, making powerful SSL accessible to researchers with limited hardware resources.
title Efficient Multi-scale Masked Autoencoders with Hybrid-Attention Mechanism for Breast Lesion Classification
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2503.07157