Saved in:
| Main Authors: | Borui, Wan, Juntao, Zhao, Chenyu, Jiang, Chuanxiong, Guo, Chuan, Wu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.09590 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
by: Zhao, Juntao, et al.
Published: (2024)
by: Zhao, Juntao, et al.
Published: (2024)
Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
by: Zhao, Juntao, et al.
Published: (2025)
by: Zhao, Juntao, et al.
Published: (2025)
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
by: Ikram, Azam, et al.
Published: (2025)
by: Ikram, Azam, et al.
Published: (2025)
Learned Best-Effort LLM Serving
by: Jha, Siddharth, et al.
Published: (2024)
by: Jha, Siddharth, et al.
Published: (2024)
Taming the Titans: A Survey of Efficient LLM Inference Serving
by: Zhen, Ranran, et al.
Published: (2025)
by: Zhen, Ranran, et al.
Published: (2025)
Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation
by: Wu, Zhanglin, et al.
Published: (2025)
by: Wu, Zhanglin, et al.
Published: (2025)
Intelligence Foundation Model: A New Perspective to Approach Artificial General Intelligence
by: Cai, Borui, et al.
Published: (2025)
by: Cai, Borui, et al.
Published: (2025)
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
by: Xie, Jincheng, et al.
Published: (2026)
by: Xie, Jincheng, et al.
Published: (2026)
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
by: Shirokikh, Mikhail, et al.
Published: (2026)
by: Shirokikh, Mikhail, et al.
Published: (2026)
Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling
by: Guo, Jizhou, et al.
Published: (2025)
by: Guo, Jizhou, et al.
Published: (2025)
QSpec: Speculative Decoding with Complementary Quantization Schemes
by: Zhao, Juntao, et al.
Published: (2024)
by: Zhao, Juntao, et al.
Published: (2024)
TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
by: Zhao, Juntao, et al.
Published: (2025)
by: Zhao, Juntao, et al.
Published: (2025)
Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving
by: Wu, Fangzhou, et al.
Published: (2025)
by: Wu, Fangzhou, et al.
Published: (2025)
AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation
by: Liu, Huanxi, et al.
Published: (2024)
by: Liu, Huanxi, et al.
Published: (2024)
MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches
by: Wang, Xin, et al.
Published: (2026)
by: Wang, Xin, et al.
Published: (2026)
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
by: Cheng, Xueqi, et al.
Published: (2026)
by: Cheng, Xueqi, et al.
Published: (2026)
Efficient Serving of LLM Applications with Probabilistic Demand Modeling
by: Liu, Yifei, et al.
Published: (2025)
by: Liu, Yifei, et al.
Published: (2025)
QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices
by: Zhao, Juntao, et al.
Published: (2024)
by: Zhao, Juntao, et al.
Published: (2024)
BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025)
by: Zheng, Wanyi, et al.
Published: (2025)
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
by: Lin, Chaofan, et al.
Published: (2024)
by: Lin, Chaofan, et al.
Published: (2024)
EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
by: Feng, Shaoting, et al.
Published: (2025)
by: Feng, Shaoting, et al.
Published: (2025)
KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving
by: Cheng, Rongxin, et al.
Published: (2024)
by: Cheng, Rongxin, et al.
Published: (2024)
Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention
by: Wang, Zhiming, et al.
Published: (2026)
by: Wang, Zhiming, et al.
Published: (2026)
PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference
by: Wang, Qirui, et al.
Published: (2026)
by: Wang, Qirui, et al.
Published: (2026)
Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention
by: Liao, Mengqi, et al.
Published: (2026)
by: Liao, Mengqi, et al.
Published: (2026)
SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training
by: Liu, Yuliang, et al.
Published: (2025)
by: Liu, Yuliang, et al.
Published: (2025)
SelectiveShield: Lightweight Hybrid Defense Against Gradient Leakage in Federated Learning
by: Li, Borui, et al.
Published: (2025)
by: Li, Borui, et al.
Published: (2025)
REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
by: Tang, Annabelle Sujun, et al.
Published: (2025)
by: Tang, Annabelle Sujun, et al.
Published: (2025)
LLM Inference Serving: Survey of Recent Advances and Opportunities
by: Li, Baolin, et al.
Published: (2024)
by: Li, Baolin, et al.
Published: (2024)
TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors
by: Hsieh, Jun-Wei, et al.
Published: (2026)
by: Hsieh, Jun-Wei, et al.
Published: (2026)
MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection
by: Lei, Xiaochun, et al.
Published: (2025)
by: Lei, Xiaochun, et al.
Published: (2025)
Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
by: Wadlom, Noppanat, et al.
Published: (2026)
by: Wadlom, Noppanat, et al.
Published: (2026)
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
by: Jayakody, Shakya, et al.
Published: (2026)
by: Jayakody, Shakya, et al.
Published: (2026)
EF-LLM: Energy Forecasting LLM with AI-assisted Automation, Enhanced Sparse Prediction, Hallucination Detection
by: Qiu, Zihang, et al.
Published: (2024)
by: Qiu, Zihang, et al.
Published: (2024)
Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving
by: Huan, Chengying, et al.
Published: (2025)
by: Huan, Chengying, et al.
Published: (2025)
Hybrid Inverse Reinforcement Learning
by: Ren, Juntao, et al.
Published: (2024)
by: Ren, Juntao, et al.
Published: (2024)
Writing Like the Best: Exemplar-Based Expository Text Generation
by: Liu, Yuxiang, et al.
Published: (2025)
by: Liu, Yuxiang, et al.
Published: (2025)
MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation
by: Chen, Ming, et al.
Published: (2025)
by: Chen, Ming, et al.
Published: (2025)
ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
by: Wan, Borui, et al.
Published: (2024)
by: Wan, Borui, et al.
Published: (2024)
Similar Items
-
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
by: Zhao, Juntao, et al.
Published: (2024) -
Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
by: Zhao, Juntao, et al.
Published: (2025) -
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
by: Ikram, Azam, et al.
Published: (2025) -
Learned Best-Effort LLM Serving
by: Jha, Siddharth, et al.
Published: (2024) -
Taming the Titans: A Survey of Efficient LLM Inference Serving
by: Zhen, Ranran, et al.
Published: (2025)