Sommario: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Skliar, Andrii, van Rozendaal, Ties, Lepert, Romain, Boinovski, Todor, van Baalen, Mart, Nagel, Markus, Whatmough, Paul, Bejnordi, Babak Ehteshami
Natura:	Preprint
Pubblicazione:	2024
Soggetti:	Machine Learning Artificial Intelligence Hardware Architecture
Accesso online:	https://arxiv.org/abs/2412.00099
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Sommario:

Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.

Documenti analoghi