Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Sui, Yueyuan, Mohapatra, Payal, Eldenk, Doğaç, Yang, Haodong, Zhang, Yiting, Zhang, Haoyan, Zhu, Qi, Xia, Stephen
Format:	Preprint
Published:	2026
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2604.08971
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908951235264512
author	Sui, Yueyuan Mohapatra, Payal Eldenk, Doğaç Yang, Haodong Zhang, Yiting Zhang, Haoyan Zhu, Qi Xia, Stephen
author_facet	Sui, Yueyuan Mohapatra, Payal Eldenk, Doğaç Yang, Haodong Zhang, Yiting Zhang, Haoyan Zhu, Qi Xia, Stephen
contents	Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10\times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to $1.63\times$ without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_08971
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference Sui, Yueyuan Mohapatra, Payal Eldenk, Doğaç Yang, Haodong Zhang, Yiting Zhang, Haoyan Zhu, Qi Xia, Stephen Machine Learning Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10\times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs across three different multimodal architectures. Across three applications and multimodal backbones, SentryGate achieves a 12.7% average accuracy improvement over the strongest pruning baseline, and upto to 18% under modality dropout conditions. Together, SentryFuse reduces memory by 28.2% and lowers latency by up to $1.63\times$ without further fine-tuning, establishing modality-aware zero-shot compression as a practical path to multimodal intelligence on heterogeneous edge hardware.
title	Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference
topic	Machine Learning
url	https://arxiv.org/abs/2604.08971

Similar Items