Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhong, Wanli, Feng, Haibo, Zhou, Zirui, Peng, Hanyang, Yu, Shiqi
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2511.21513
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918530730950656
author	Zhong, Wanli Feng, Haibo Zhou, Zirui Peng, Hanyang Yu, Shiqi
author_facet	Zhong, Wanli Feng, Haibo Zhou, Zirui Peng, Hanyang Yu, Shiqi
contents	Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottleneck. This stage incurs a costly dequantize -> softmax -> requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer attention pipeline that serves as a training-free drop-in replacement. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup table approximation, and direct integer normalization, thereby eliminating datatype conversion overhead along the attention path. Experiments on Armv8 CPUs show that our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines, and up to 2.0x speedup over conventional INT8 attention pipelines. Across diverse language and vision models, as well as additional reasoning and long-context evaluations, IntAttention maintains strong overall fidelity and demonstrates a more favorable trade-off than existing LUT-based softmax approximations. Code is available at https://github.com/WanliZhong/IntAttention
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_21513
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference Zhong, Wanli Feng, Haibo Zhou, Zirui Peng, Hanyang Yu, Shiqi Machine Learning Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottleneck. This stage incurs a costly dequantize -> softmax -> requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer attention pipeline that serves as a training-free drop-in replacement. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup table approximation, and direct integer normalization, thereby eliminating datatype conversion overhead along the attention path. Experiments on Armv8 CPUs show that our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines, and up to 2.0x speedup over conventional INT8 attention pipelines. Across diverse language and vision models, as well as additional reasoning and long-context evaluations, IntAttention maintains strong overall fidelity and demonstrates a more favorable trade-off than existing LUT-based softmax approximations. Code is available at https://github.com/WanliZhong/IntAttention
title	IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
topic	Machine Learning
url	https://arxiv.org/abs/2511.21513

Similar Items