Saved in:
Bibliographic Details
Main Authors: Nishu, Kumari, Mehta, Sachin, Abnar, Samira, Farajtabar, Mehrdad, Horton, Maxwell, Najibi, Mahyar, Nabi, Moin, Cho, Minsik, Naik, Devang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.12325
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916619026956288
author Nishu, Kumari
Mehta, Sachin
Abnar, Samira
Farajtabar, Mehrdad
Horton, Maxwell
Najibi, Mahyar
Nabi, Moin
Cho, Minsik
Naik, Devang
author_facet Nishu, Kumari
Mehta, Sachin
Abnar, Samira
Farajtabar, Mehrdad
Horton, Maxwell
Najibi, Mahyar
Nabi, Moin
Cho, Minsik
Naik, Devang
contents Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.
format Preprint
id arxiv_https___arxiv_org_abs_2502_12325
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
Nishu, Kumari
Mehta, Sachin
Abnar, Samira
Farajtabar, Mehrdad
Horton, Maxwell
Najibi, Mahyar
Nabi, Moin
Cho, Minsik
Naik, Devang
Computation and Language
Training large language models (LLMs) for different inference constraints is computationally expensive, limiting control over efficiency-accuracy trade-offs. Moreover, once trained, these models typically process tokens uniformly, regardless of their complexity, leading to static and inflexible behavior. In this paper, we introduce a post-training optimization framework, DynaMoE, that adapts a pre-trained dense LLM to a token-difficulty-driven Mixture-of-Experts model with minimal fine-tuning cost. This adaptation makes the model dynamic, with sensitivity control to customize the balance between efficiency and accuracy. DynaMoE features a token-difficulty-aware router that predicts the difficulty of tokens and directs them to the appropriate sub-networks or experts, enabling larger experts to handle more complex tokens and smaller experts to process simpler ones. Our experiments demonstrate that DynaMoE can generate a range of adaptive model variants of the existing trained LLM with a single fine-tuning step, utilizing only $10B$ tokens, a minimal cost compared to the base model's training. Each variant offers distinct trade-offs between accuracy and performance. Compared to the baseline post-training optimization framework, Flextron, our method achieves similar aggregated accuracy across downstream tasks, despite using only $\frac{1}{9}\text{th}$ of their fine-tuning cost.
title From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs
topic Computation and Language
url https://arxiv.org/abs/2502.12325