Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.19342 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914605866942464 |
|---|---|
| author | Liu, Xiao Zhang, Lijun Ganesan, Deepak Guan, Hui |
| author_facet | Liu, Xiao Zhang, Lijun Ganesan, Deepak Guan, Hui |
| contents | Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2505_19342 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference Liu, Xiao Zhang, Lijun Ganesan, Deepak Guan, Hui Machine Learning Artificial Intelligence Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks. |
| title | ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference |
| topic | Machine Learning Artificial Intelligence |
| url | https://arxiv.org/abs/2505.19342 |