Saved in:
Bibliographic Details
Main Authors: Wang, Jun, Yao, Yunxiang, Kuang, Wenwei, Mao, Runze, Sun, Zhenhao, Tao, Zhuang, Zhang, Ziyang, Li, Dengyu, Chen, Jiajun, Wang, Zhili, Cui, Kai, Cai, Congzhi, Lan, Longwen, Zhang, Ken
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.22481
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C cluster, OmniInfer achieves 616 QPM, where the unified framework reduces TPOT by 36\%, and the superimposition of OmniProxy further slashes TTFT by 38\%. The project is open-sourced at [this https URL](https://gitee.com/omniai/omniinfer).