:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ikram, Azam, Li, Xiang, Elnikety, Sameh, Bagchi, Saurabh
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.20828
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Analytically-Driven Resource Management for Cloud-Native Microservices
by: Zhang, Yanqi, et al.
Published: (2024)

Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
by: Borui, Wan, et al.
Published: (2025)

StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
by: Kumar, Satyam, et al.
Published: (2026)

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025)

Hubs and Spokes Learning: Efficient and Scalable Collaborative Machine Learning
by: Sharma, Atul, et al.
Published: (2025)

StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving
by: Nouri, Azam
Published: (2026)

TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)

AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation
by: Liu, Huanxi, et al.
Published: (2024)

Beyond Corner Patches: Semantics-Aware Backdoor Attack in Federated Learning
by: Herath, Kavindu, et al.
Published: (2026)

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
by: Lin, Chaofan, et al.
Published: (2024)

CARE: Compatibility-Aware Incentive Mechanisms for Federated Learning with Budgeted Requesters
by: Liu, Xiang, et al.
Published: (2025)

KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving
by: Cheng, Rongxin, et al.
Published: (2024)

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
by: Gautam, Arpit Singh, et al.
Published: (2026)

Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention
by: Wang, Zhiming, et al.
Published: (2026)

Information Ecosystem Reengineering via Public Sector Knowledge Representation
by: Bagchi, Mayukh
Published: (2025)

Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
by: Sivagnanam, Amutheezan, et al.
Published: (2026)

Causal Information Prioritization for Efficient Reinforcement Learning
by: Cao, Hongye, et al.
Published: (2025)

EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
by: Feng, Shaoting, et al.
Published: (2025)

REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
by: Tang, Annabelle Sujun, et al.
Published: (2025)

Efficient Serving of LLM Applications with Probabilistic Demand Modeling
by: Liu, Yifei, et al.
Published: (2025)

Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
by: Wadlom, Noppanat, et al.
Published: (2026)

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
by: Kamahori, Keisuke, et al.
Published: (2026)

SOMA: Efficient Multi-turn LLM Serving via Small Language Model
by: Cheng, Xueqi, et al.
Published: (2026)

Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
by: Li, Ruixiao, et al.
Published: (2025)

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
by: Li, Rui, et al.
Published: (2025)

Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving
by: Huan, Chengying, et al.
Published: (2025)

Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving
by: Wu, Fangzhou, et al.
Published: (2025)

Taming the Titans: A Survey of Efficient LLM Inference Serving
by: Zhen, Ranran, et al.
Published: (2025)

Patchwork: A Unified Framework for RAG Serving
by: Hu, Bodun, et al.
Published: (2025)

Hierarchical Autoscaling for Large Language Model Serving with Chiron
by: Patke, Archit, et al.
Published: (2025)

Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)

SUN: Shared Use of Next-token Prediction for Efficient Multi-LLM Disaggregated Serving
by: Woo, Sunghyeon, et al.
Published: (2026)

Autellix: An Efficient Serving Engine for LLM Agents as General Programs
by: Luo, Michael, et al.
Published: (2025)

MoEless: Efficient MoE LLM Serving via Serverless Computing
by: Yu, Hanfei, et al.
Published: (2026)

A Generative AI-driven Metadata Modelling Approach
by: Bagchi, Mayukh
Published: (2024)

D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
by: Wang, Haodong, et al.
Published: (2025)

POLAR: Automating Cyber Threat Prioritization through LLM-Powered Assessment
by: Tang, Luoxi, et al.
Published: (2025)

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
by: Yu, Shan, et al.
Published: (2025)

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
by: Xie, Jincheng, et al.
Published: (2026)

Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
by: Xia, Tianhua, et al.
Published: (2025)