Saved in:
| Main Authors: | Ren, Feng, Qin, Ruoyu, Ma, Teng, Cai, Shangming, Liu, Zheng, Lei, Chao, Zhu, Dejiang, Yang, Ke, Li, Zheming, Cui, Jialei, Huang, Weixiao, Zhao, Yikai, Zhang, Yineng, Wu, Hao, Gao, Xiang, Fu, Yuhao, Jiang, Jinlei, Wu, Yongwei, Zhang, Mingxing |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.00368 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
by: Qin, Ruoyu, et al.
Published: (2024)
by: Qin, Ruoyu, et al.
Published: (2024)
Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation
by: Chen, Shaoyuan, et al.
Published: (2024)
by: Chen, Shaoyuan, et al.
Published: (2024)
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
by: Qin, Ruoyu, et al.
Published: (2025)
by: Qin, Ruoyu, et al.
Published: (2025)
Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
by: Li, Zongze, et al.
Published: (2026)
by: Li, Zongze, et al.
Published: (2026)
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
by: Qin, Ruoyu, et al.
Published: (2026)
by: Qin, Ruoyu, et al.
Published: (2026)
Efficient Graph-Based Approximate Nearest Neighbor Search Achieving: Low Latency Without Throughput Loss
by: Luo, Jingjia, et al.
Published: (2025)
by: Luo, Jingjia, et al.
Published: (2025)
HydraInfer: Hybrid Disaggregated Scheduling for Multimodal Large Language Model Serving
by: Dong, Xianzhe, et al.
Published: (2025)
by: Dong, Xianzhe, et al.
Published: (2025)
Physical parameter regression from black hole images via a multiscale adaptive neural network
by: Wei, Jialei, et al.
Published: (2025)
by: Wei, Jialei, et al.
Published: (2025)
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
by: Ye, Zihao, et al.
Published: (2025)
by: Ye, Zihao, et al.
Published: (2025)
DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving
by: Ruan, Chaoyi, et al.
Published: (2025)
by: Ruan, Chaoyi, et al.
Published: (2025)
P/D-Serve: Serving Disaggregated Large Language Model at Scale
by: Jin, Yibo, et al.
Published: (2024)
by: Jin, Yibo, et al.
Published: (2024)
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures
by: Hu, Xiannan, et al.
Published: (2025)
by: Hu, Xiannan, et al.
Published: (2025)
TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and Nodes
by: Huang, Jialiang, et al.
Published: (2025)
by: Huang, Jialiang, et al.
Published: (2025)
SAEC: Scene-Aware Enhanced Edge-Cloud Collaborative Industrial Vision Inspection with Multimodal LLM
by: Tian, Yuhao, et al.
Published: (2025)
by: Tian, Yuhao, et al.
Published: (2025)
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving
by: Cheng, Ke, et al.
Published: (2024)
by: Cheng, Ke, et al.
Published: (2024)
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
by: Zhong, Yinmin, et al.
Published: (2024)
by: Zhong, Yinmin, et al.
Published: (2024)
Trinity: Disaggregating Vector Search from Prefill-Decode Disaggregation in LLM Serving
by: Liu, Yi, et al.
Published: (2025)
by: Liu, Yi, et al.
Published: (2025)
ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving
by: Qiu, Haoran, et al.
Published: (2025)
by: Qiu, Haoran, et al.
Published: (2025)
End Khovanov homology and exotic Lagrangian planes
by: Teng, Yikai
Published: (2025)
by: Teng, Yikai
Published: (2025)
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
by: Zhu, Ruidong, et al.
Published: (2025)
by: Zhu, Ruidong, et al.
Published: (2025)
Efficiently Serving Large Multimodal Models Using EPD Disaggregation
by: Singh, Gursimran, et al.
Published: (2024)
by: Singh, Gursimran, et al.
Published: (2024)
BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: He, Yiyuan, et al.
Published: (2025)
by: He, Yiyuan, et al.
Published: (2025)
BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure
by: Yiyuan He, et al.
Published: (2026)
by: Yiyuan He, et al.
Published: (2026)
Uncertainty Quantification and Flow Dynamics in Rotating Detonation Engines
by: Kumar, Vinay, et al.
Published: (2025)
by: Kumar, Vinay, et al.
Published: (2025)
semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage
by: Hong, Ke, et al.
Published: (2025)
by: Hong, Ke, et al.
Published: (2025)
GreenLLM: Disaggregating Large Language Model Serving on Heterogeneous GPUs for Lower Carbon Emissions
by: Shi, Tianyao, et al.
Published: (2024)
by: Shi, Tianyao, et al.
Published: (2024)
MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition
by: Li, Feng, et al.
Published: (2025)
by: Li, Feng, et al.
Published: (2025)
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
by: Hu, Cunchen, et al.
Published: (2024)
by: Hu, Cunchen, et al.
Published: (2024)
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)
by: Kumar, Satyam, et al.
Published: (2026)
TENT5/FAM46: An Enigmatic Family of Secretory Tuners
by: Daniel Lacidogna, et al.
Published: (2025)
by: Daniel Lacidogna, et al.
Published: (2025)
The Effects of Circadian Rhythms and Exercise Preconditioning on Cardiac Troponin T Levels Following Graded Exercise
by: Jinlei Nie, et al.
Published: (2025)
by: Jinlei Nie, et al.
Published: (2025)
How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving
by: Wu, Hanjiang, et al.
Published: (2026)
by: Wu, Hanjiang, et al.
Published: (2026)
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
by: Sun, Xun, et al.
Published: (2026)
by: Sun, Xun, et al.
Published: (2026)
EPD-Serve: A Flexible Multimodal EPD Disaggregation Inference Serving System On Ascend
by: Bai, Fan, et al.
Published: (2026)
by: Bai, Fan, et al.
Published: (2026)
Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture
by: Wu, Yu, et al.
Published: (2025)
by: Wu, Yu, et al.
Published: (2025)
TokenScale: Timely and Accurate Autoscaling for Disaggregated LLM Serving with Token Velocity
by: Lai, Ruiqi, et al.
Published: (2025)
by: Lai, Ruiqi, et al.
Published: (2025)
OBSERVATIONS ON TENT-USING IN THE CAROLLINE BAT RHINOPYLLA PUMILIO IN SOUTHEASTERN BRAZIL
by: Zortéa, Marlon
Published: (1995)
by: Zortéa, Marlon
Published: (1995)
Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving
by: Ding, Jianru, et al.
Published: (2026)
by: Ding, Jianru, et al.
Published: (2026)
Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)
by: He, Wenhao, et al.
Published: (2026)
Recursive Offloading for LLM Serving in Multi-tier Networks
by: Wu, Zhiyuan, et al.
Published: (2025)
by: Wu, Zhiyuan, et al.
Published: (2025)
Similar Items
-
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
by: Qin, Ruoyu, et al.
Published: (2024) -
Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation
by: Chen, Shaoyuan, et al.
Published: (2024) -
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
by: Qin, Ruoyu, et al.
Published: (2025) -
Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
by: Li, Zongze, et al.
Published: (2026) -
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
by: Qin, Ruoyu, et al.
Published: (2026)