Saved in:
| Main Authors: | Ikram, Azam, Li, Xiang, Elnikety, Sameh, Bagchi, Saurabh |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.20828 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Analytically-Driven Resource Management for Cloud-Native Microservices
by: Zhang, Yanqi, et al.
Published: (2024)
by: Zhang, Yanqi, et al.
Published: (2024)
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
by: Borui, Wan, et al.
Published: (2025)
by: Borui, Wan, et al.
Published: (2025)
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)
by: Kumar, Satyam, et al.
Published: (2026)
BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025)
by: Zheng, Wanyi, et al.
Published: (2025)
Hubs and Spokes Learning: Efficient and Scalable Collaborative Machine Learning
by: Sharma, Atul, et al.
Published: (2025)
by: Sharma, Atul, et al.
Published: (2025)
StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving
by: Nouri, Azam
Published: (2026)
by: Nouri, Azam
Published: (2026)
TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation
by: Liu, Huanxi, et al.
Published: (2024)
by: Liu, Huanxi, et al.
Published: (2024)
Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning
by: Herath, Kavindu, et al.
Published: (2026)
by: Herath, Kavindu, et al.
Published: (2026)
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
by: Lin, Chaofan, et al.
Published: (2024)
by: Lin, Chaofan, et al.
Published: (2024)
CARE: Compatibility-Aware Incentive Mechanisms for Federated Learning with Budgeted Requesters
by: Liu, Xiang, et al.
Published: (2025)
by: Liu, Xiang, et al.
Published: (2025)
KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving
by: Cheng, Rongxin, et al.
Published: (2024)
by: Cheng, Rongxin, et al.
Published: (2024)
RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
by: Gautam, Arpit Singh, et al.
Published: (2026)
by: Gautam, Arpit Singh, et al.
Published: (2026)
Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention
by: Wang, Zhiming, et al.
Published: (2026)
by: Wang, Zhiming, et al.
Published: (2026)
Information Ecosystem Reengineering via Public Sector Knowledge Representation
by: Bagchi, Mayukh
Published: (2025)
by: Bagchi, Mayukh
Published: (2025)
Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
by: Sivagnanam, Amutheezan, et al.
Published: (2026)
by: Sivagnanam, Amutheezan, et al.
Published: (2026)
Causal Information Prioritization for Efficient Reinforcement Learning
by: Cao, Hongye, et al.
Published: (2025)
by: Cao, Hongye, et al.
Published: (2025)
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
by: Feng, Shaoting, et al.
Published: (2025)
by: Feng, Shaoting, et al.
Published: (2025)
REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
by: Tang, Annabelle Sujun, et al.
Published: (2025)
by: Tang, Annabelle Sujun, et al.
Published: (2025)
Efficient Serving of LLM Applications with Probabilistic Demand Modeling
by: Liu, Yifei, et al.
Published: (2025)
by: Liu, Yifei, et al.
Published: (2025)
Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
by: Wadlom, Noppanat, et al.
Published: (2026)
by: Wadlom, Noppanat, et al.
Published: (2026)
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
by: Kamahori, Keisuke, et al.
Published: (2026)
by: Kamahori, Keisuke, et al.
Published: (2026)
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
by: Cheng, Xueqi, et al.
Published: (2026)
by: Cheng, Xueqi, et al.
Published: (2026)
Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
by: Li, Ruixiao, et al.
Published: (2025)
by: Li, Ruixiao, et al.
Published: (2025)
Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
by: Li, Rui, et al.
Published: (2025)
by: Li, Rui, et al.
Published: (2025)
Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving
by: Huan, Chengying, et al.
Published: (2025)
by: Huan, Chengying, et al.
Published: (2025)
Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving
by: Wu, Fangzhou, et al.
Published: (2025)
by: Wu, Fangzhou, et al.
Published: (2025)
Taming the Titans: A Survey of Efficient LLM Inference Serving
by: Zhen, Ranran, et al.
Published: (2025)
by: Zhen, Ranran, et al.
Published: (2025)
Patchwork: A Unified Framework for RAG Serving
by: Hu, Bodun, et al.
Published: (2025)
by: Hu, Bodun, et al.
Published: (2025)
Hierarchical Autoscaling for Large Language Model Serving with Chiron
by: Patke, Archit, et al.
Published: (2025)
by: Patke, Archit, et al.
Published: (2025)
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)
by: Zhou, Qihui, et al.
Published: (2025)
SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving
by: Woo, Sunghyeon, et al.
Published: (2026)
by: Woo, Sunghyeon, et al.
Published: (2026)
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
by: Luo, Michael, et al.
Published: (2025)
by: Luo, Michael, et al.
Published: (2025)
MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)
by: Yu, Hanfei, et al.
Published: (2026)
A Generative AI-driven Metadata Modelling Approach
by: Bagchi, Mayukh
Published: (2024)
by: Bagchi, Mayukh
Published: (2024)
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
by: Wang, Haodong, et al.
Published: (2025)
by: Wang, Haodong, et al.
Published: (2025)
POLAR: Automating Cyber Threat Prioritization through LLM-Powered Assessment
by: Tang, Luoxi, et al.
Published: (2025)
by: Tang, Luoxi, et al.
Published: (2025)
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
by: Yu, Shan, et al.
Published: (2025)
by: Yu, Shan, et al.
Published: (2025)
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
by: Xie, Jincheng, et al.
Published: (2026)
by: Xie, Jincheng, et al.
Published: (2026)
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
by: Xia, Tianhua, et al.
Published: (2025)
by: Xia, Tianhua, et al.
Published: (2025)
Similar Items
-
Analytically-Driven Resource Management for Cloud-Native Microservices
by: Zhang, Yanqi, et al.
Published: (2024) -
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
by: Borui, Wan, et al.
Published: (2025) -
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026) -
BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025) -
Hubs and Spokes Learning: Efficient and Scalable Collaborative Machine Learning
by: Sharma, Atul, et al.
Published: (2025)