Saved in:
Bibliographic Details
Main Authors: Wang, Jun, Yao, Yunxiang, Kuang, Wenwei, Mao, Runze, Sun, Zhenhao, Tao, Zhuang, Zhang, Ziyang, Li, Dengyu, Chen, Jiajun, Wang, Zhili, Cui, Kai, Cai, Congzhi, Lan, Longwen, Zhang, Ken
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.22481
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915642462961664
author Wang, Jun
Yao, Yunxiang
Kuang, Wenwei
Mao, Runze
Sun, Zhenhao
Tao, Zhuang
Zhang, Ziyang
Li, Dengyu
Chen, Jiajun
Wang, Zhili
Cui, Kai
Cai, Congzhi
Lan, Longwen
Zhang, Ken
author_facet Wang, Jun
Yao, Yunxiang
Kuang, Wenwei
Mao, Runze
Sun, Zhenhao
Tao, Zhuang
Zhang, Ziyang
Li, Dengyu
Chen, Jiajun
Wang, Zhili
Cui, Kai
Cai, Congzhi
Lan, Longwen
Zhang, Ken
contents Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C cluster, OmniInfer achieves 616 QPM, where the unified framework reduces TPOT by 36\%, and the superimposition of OmniProxy further slashes TTFT by 38\%. The project is open-sourced at [this https URL](https://gitee.com/omniai/omniinfer).
format Preprint
id arxiv_https___arxiv_org_abs_2511_22481
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency
Wang, Jun
Yao, Yunxiang
Kuang, Wenwei
Mao, Runze
Sun, Zhenhao
Tao, Zhuang
Zhang, Ziyang
Li, Dengyu
Chen, Jiajun
Wang, Zhili
Cui, Kai
Cai, Congzhi
Lan, Longwen
Zhang, Ken
Distributed, Parallel, and Cluster Computing
Large Language Models drive a wide range of modern AI applications but impose substantial challenges on large-scale serving systems due to intensive computation, strict latency constraints, and throughput bottlenecks. We introduce OmniInfer, a unified system-level acceleration framework designed to maximize end-to-end serving efficiency through fine-grained optimization of expert placement, cache compression, and scheduling. OmniInfer integrates three complementary components: OmniPlacement for load-aware Mixture-of-Experts scheduling, OmniAttn for sparse attention acceleration, and OmniProxy for disaggregation-aware request scheduling. Built atop vLLM, OmniInfer delivers system-wide performance gains through adaptive resource disaggregation, efficient sparsity exploitation, and global coordination across prefill and decode phases. Evaluated on DeepSeek-R1 within a 10-node Ascend 910C cluster, OmniInfer achieves 616 QPM, where the unified framework reduces TPOT by 36\%, and the superimposition of OmniProxy further slashes TTFT by 38\%. The project is open-sourced at [this https URL](https://gitee.com/omniai/omniinfer).
title OmniInfer: System-Wide Acceleration Techniques for Optimizing LLM Serving Throughput and Latency
topic Distributed, Parallel, and Cluster Computing
url https://arxiv.org/abs/2511.22481