Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nishu, Kumari, Mehta, Sachin, Abnar, Samira, Farajtabar, Mehrdad, Horton, Maxwell, Najibi, Mahyar, Nabi, Moin, Cho, Minsik, Naik, Devang
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.12325
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916619026956288
author	Nishu, Kumari Mehta, Sachin Abnar, Samira Farajtabar, Mehrdad Horton, Maxwell Najibi, Mahyar Nabi, Moin Cho, Minsik Naik, Devang
author_facet	Nishu, Kumari Mehta, Sachin Abnar, Samira Farajtabar, Mehrdad Horton, Maxwell Najibi, Mahyar Nabi, Moin Cho, Minsik Naik, Devang
contents	Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_12325
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs Nishu, Kumari Mehta, Sachin Abnar, Samira Farajtabar, Mehrdad Horton, Maxwell Najibi, Mahyar Nabi, Moin Cho, Minsik Naik, Devang Computation and Language Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.
title	From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
topic	Computation and Language
url	https://arxiv.org/abs/2502.12325

Similar Items