Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Songyu, Tam, Aaron, Lee, Myungjin, Qi, Shixiong, Ramakrishnan, K. K.
Format:	Preprint
Published:	2026
Subjects:	Distributed, Parallel, and Cluster Computing Machine Learning
Online Access:	https://arxiv.org/abs/2601.01310
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908749428424704
author	Zhang, Songyu Tam, Aaron Lee, Myungjin Qi, Shixiong Ramakrishnan, K. K.
author_facet	Zhang, Songyu Tam, Aaron Lee, Myungjin Qi, Shixiong Ramakrishnan, K. K.
contents	Mixture-of-Experts (MoE) models are increasingly used to serve LLMs at scale, but failures become common as deployment scale grows. Existing systems exhibit poor failure resilience: even a single worker failure triggers a coarse-grained, service-wide restart, discarding accumulated progress and halting the entire inference pipeline during recovery--an approach clearly ill-suited for latency-sensitive, LLM services. We present Tarragon, a resilient MoE inference framework that confines the failures impact to individual workers while allowing the rest of the pipeline to continue making forward progress. Tarragon exploits the natural separation between the attention and expert computation in MoE-based transformers, treating attention workers (AWs) and expert workers (EWs) as distinct failure domains. Tarragon introduces a reconfigurable datapath to mask failures by rerouting requests to healthy workers. On top of this datapath, Tarragon implements a self-healing mechanism that relaxes the tightly synchronized execution of existing MoE frameworks. For stateful AWs, Tarragon performs asynchronous, incremental KV cache checkpointing with per-request restoration, and for stateless EWs, it leverages residual GPU memory to deploy shadow experts. These together keep recovery cost and recomputation overhead extremely low. Our evaluation shows that, compared to state-of-the-art MegaScale-Infer, Tarragon reduces failure-induced stalls by 160-213x (from ~64 s down to 0.3-0.4 s) while preserving performance when no failures occur.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_01310
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Making MoE-based LLM Inference Resilient with Tarragon Zhang, Songyu Tam, Aaron Lee, Myungjin Qi, Shixiong Ramakrishnan, K. K. Distributed, Parallel, and Cluster Computing Machine Learning Mixture-of-Experts (MoE) models are increasingly used to serve LLMs at scale, but failures become common as deployment scale grows. Existing systems exhibit poor failure resilience: even a single worker failure triggers a coarse-grained, service-wide restart, discarding accumulated progress and halting the entire inference pipeline during recovery--an approach clearly ill-suited for latency-sensitive, LLM services. We present Tarragon, a resilient MoE inference framework that confines the failures impact to individual workers while allowing the rest of the pipeline to continue making forward progress. Tarragon exploits the natural separation between the attention and expert computation in MoE-based transformers, treating attention workers (AWs) and expert workers (EWs) as distinct failure domains. Tarragon introduces a reconfigurable datapath to mask failures by rerouting requests to healthy workers. On top of this datapath, Tarragon implements a self-healing mechanism that relaxes the tightly synchronized execution of existing MoE frameworks. For stateful AWs, Tarragon performs asynchronous, incremental KV cache checkpointing with per-request restoration, and for stateless EWs, it leverages residual GPU memory to deploy shadow experts. These together keep recovery cost and recomputation overhead extremely low. Our evaluation shows that, compared to state-of-the-art MegaScale-Infer, Tarragon reduces failure-induced stalls by 160-213x (from ~64 s down to 0.3-0.4 s) while preserving performance when no failures occur.
title	Making MoE-based LLM Inference Resilient with Tarragon
topic	Distributed, Parallel, and Cluster Computing Machine Learning
url	https://arxiv.org/abs/2601.01310

Similar Items