Saved in:
Bibliographic Details
Main Authors: Zhong, Wanli, Feng, Haibo, Zhou, Zirui, Peng, Hanyang, Yu, Shiqi
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.21513
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918530730950656
author Zhong, Wanli
Feng, Haibo
Zhou, Zirui
Peng, Hanyang
Yu, Shiqi
author_facet Zhong, Wanli
Feng, Haibo
Zhou, Zirui
Peng, Hanyang
Yu, Shiqi
contents Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottleneck. This stage incurs a costly dequantize -> softmax -> requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer attention pipeline that serves as a training-free drop-in replacement. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup table approximation, and direct integer normalization, thereby eliminating datatype conversion overhead along the attention path. Experiments on Armv8 CPUs show that our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines, and up to 2.0x speedup over conventional INT8 attention pipelines. Across diverse language and vision models, as well as additional reasoning and long-context evaluations, IntAttention maintains strong overall fidelity and demonstrates a more favorable trade-off than existing LUT-based softmax approximations. Code is available at https://github.com/WanliZhong/IntAttention
format Preprint
id arxiv_https___arxiv_org_abs_2511_21513
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
Zhong, Wanli
Feng, Haibo
Zhou, Zirui
Peng, Hanyang
Yu, Shiqi
Machine Learning
Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax-related path as the dominant bottleneck. This stage incurs a costly dequantize -> softmax -> requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer attention pipeline that serves as a training-free drop-in replacement. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup table approximation, and direct integer normalization, thereby eliminating datatype conversion overhead along the attention path. Experiments on Armv8 CPUs show that our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines, and up to 2.0x speedup over conventional INT8 attention pipelines. Across diverse language and vision models, as well as additional reasoning and long-context evaluations, IntAttention maintains strong overall fidelity and demonstrates a more favorable trade-off than existing LUT-based softmax approximations. Code is available at https://github.com/WanliZhong/IntAttention
title IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference
topic Machine Learning
url https://arxiv.org/abs/2511.21513