Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Xiao, Zhang, Lijun, Ganesan, Deepak, Guan, Hui
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.19342
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914605866942464
author	Liu, Xiao Zhang, Lijun Ganesan, Deepak Guan, Hui
author_facet	Liu, Xiao Zhang, Lijun Ganesan, Deepak Guan, Hui
contents	Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.
format	Preprint
id	arxiv_https___arxiv_org_abs_2505_19342
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference Liu, Xiao Zhang, Lijun Ganesan, Deepak Guan, Hui Machine Learning Artificial Intelligence Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.
title	ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2505.19342

Similar Items