Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Hu, Junyi, Bai, Tian, Wu, Fengyi, Peng, Zhenming, Zhang, Yi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.12772
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Feature fusion plays a pivotal role in achieving high performance in vision models, yet existing attention-based fusion techniques often suffer from substantial computational overhead and implementation complexity, particularly in resource-constrained settings. To address these limitations, we introduce the Plug-and-Play Hierarchical C2F Transformer (P$^2$HCT), a lightweight module that combines coarse-to-fine token selection with shared attention parameters to preserve spatial details while reducing inference cost. P$^2$HCT is trainable using coarse attention alone and can be seamlessly activated at inference to enhance accuracy without retraining. Integrated into real-time detectors such as YOLOv11-N/S/M, P$^2$HCT achieves mAP gains of 0.9\%, 0.5\%, and 0.4\% on MS COCO with minimal latency increase. Similarly, embedding P$^2$HCT into ResNet-18/50/101 backbones improves ImageNet top-1 accuracy by 6.5\%, 1.7\%, and 1.0\%, respectively. These results underscore P$^2$HCT's effectiveness as a hardware-friendly and general-purpose enhancement for both detection and classification tasks.

Similar Items