Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Baek, Daehyeon, Choi, Jieun, Son, Jimyoung, Bin, Kyungmin, Choi, Seungbeom, Moon, Kihyo, Jang, Minsung, Lee, Hyojung
Format:	Preprint
Published:	2025
Subjects:	Machine Learning
Online Access:	https://arxiv.org/abs/2505.20839
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911063042162688
author	Baek, Daehyeon Choi, Jieun Son, Jimyoung Bin, Kyungmin Choi, Seungbeom Moon, Kihyo Jang, Minsung Lee, Hyojung
author_facet	Baek, Daehyeon Choi, Jieun Son, Jimyoung Bin, Kyungmin Choi, Seungbeom Moon, Kihyo Jang, Minsung Lee, Hyojung
contents	As large language models become increasingly prevalent, memory bandwidth constraints significantly limit inference throughput, motivating post-training quantization (PTQ). In this paper, we propose FireQ, a co-designed PTQ framework and an INT4-FP8 matrix multiplication kernel that accelerates LLM inference across all linear layers. Specifically, FireQ quantizes linear layer weights and key-values to INT4, and activations and queries to FP8, significantly enhancing throughput. Additionally, we introduce a three-stage pipelining for the prefill phase, which modifies the FlashAttention-3 kernel, effectively reducing time-to-first-token in the prefill phase. To minimize accuracy loss from quantization, we develop novel outlier smoothing techniques tailored separately for linear and attention layers. In linear layers, we explicitly use per-tensor scaling to prevent underflow caused by the FP8 quantization scaling factor of INT4 quantization, and channel-wise scaling to compensate for coarse granularity of INT4. In attention layers, we address quantization challenges posed by rotary positional embeddings (RoPE) by combining pre-RoPE and post-RoPE scaling strategies. FireQ significantly outperforms state-of-the-art methods, achieving 1.68x faster inference in feed-forward network layers on Llama2-7B and 1.26x faster prefill phase performance on Llama3-8B compared to QServe, with negligible accuracy loss.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_20839
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration Baek, Daehyeon Choi, Jieun Son, Jimyoung Bin, Kyungmin Choi, Seungbeom Moon, Kihyo Jang, Minsung Lee, Hyojung Machine Learning As large language models become increasingly prevalent, memory bandwidth constraints significantly limit inference throughput, motivating post-training quantization (PTQ). In this paper, we propose FireQ, a co-designed PTQ framework and an INT4-FP8 matrix multiplication kernel that accelerates LLM inference across all linear layers. Specifically, FireQ quantizes linear layer weights and key-values to INT4, and activations and queries to FP8, significantly enhancing throughput. Additionally, we introduce a three-stage pipelining for the prefill phase, which modifies the FlashAttention-3 kernel, effectively reducing time-to-first-token in the prefill phase. To minimize accuracy loss from quantization, we develop novel outlier smoothing techniques tailored separately for linear and attention layers. In linear layers, we explicitly use per-tensor scaling to prevent underflow caused by the FP8 quantization scaling factor of INT4 quantization, and channel-wise scaling to compensate for coarse granularity of INT4. In attention layers, we address quantization challenges posed by rotary positional embeddings (RoPE) by combining pre-RoPE and post-RoPE scaling strategies. FireQ significantly outperforms state-of-the-art methods, achieving 1.68x faster inference in feed-forward network layers on Llama2-7B and 1.26x faster prefill phase performance on Llama3-8B compared to QServe, with negligible accuracy loss.
title	FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration
topic	Machine Learning
url	https://arxiv.org/abs/2505.20839

Similar Items