Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Lei, Meng, Yuan, Zhan, Xiaoyu, Wang, Zhi, Zhu, Wenwu
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.14452
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910023458750464
author	Chen, Lei Meng, Yuan Zhan, Xiaoyu Wang, Zhi Zhu, Wenwu
author_facet	Chen, Lei Meng, Yuan Zhan, Xiaoyu Wang, Zhi Zhu, Wenwu
contents	Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_14452
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity Chen, Lei Meng, Yuan Zhan, Xiaoyu Wang, Zhi Zhu, Wenwu Machine Learning Artificial Intelligence Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.
title	WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2602.14452

Similar Items