:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Borui, Wan, Juntao, Zhao, Chenyu, Jiang, Chuanxiong, Guo, Chuan, Wu
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.09590
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
by: Zhao, Juntao, et al.
Published: (2024)

Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
by: Zhao, Juntao, et al.
Published: (2025)

Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
by: Ikram, Azam, et al.
Published: (2025)

Learned Best-Effort LLM Serving
by: Jha, Siddharth, et al.
Published: (2024)

Taming the Titans: A Survey of Efficient LLM Inference Serving
by: Zhen, Ranran, et al.
Published: (2025)

Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation
by: Wu, Zhanglin, et al.
Published: (2025)

Intelligence Foundation Model: A New Perspective to Approach Artificial General Intelligence
by: Cai, Borui, et al.
Published: (2025)

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
by: Xie, Jincheng, et al.
Published: (2026)

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
by: Shirokikh, Mikhail, et al.
Published: (2026)

Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling
by: Guo, Jizhou, et al.
Published: (2025)

QSpec: Speculative Decoding with Complementary Quantization Schemes
by: Zhao, Juntao, et al.
Published: (2024)

TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
by: Zhao, Juntao, et al.
Published: (2025)

Efficient Training-Free Online Routing for High-Volume Multi-LLM Serving
by: Wu, Fangzhou, et al.
Published: (2025)

AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation
by: Liu, Huanxi, et al.
Published: (2024)

MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches
by: Wang, Xin, et al.
Published: (2026)

SOMA: Efficient Multi-turn LLM Serving via Small Language Model
by: Cheng, Xueqi, et al.
Published: (2026)

Efficient Serving of LLM Applications with Probabilistic Demand Modeling
by: Liu, Yifei, et al.
Published: (2025)

QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices
by: Zhao, Juntao, et al.
Published: (2024)

BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
by: Zheng, Wanyi, et al.
Published: (2025)

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
by: Lin, Chaofan, et al.
Published: (2024)

EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
by: Feng, Shaoting, et al.
Published: (2025)

KunServe: Parameter-centric Memory Management for Efficient Memory Overloading Handling in LLM Serving
by: Cheng, Rongxin, et al.
Published: (2024)

Requesting Expert Reasoning: Augmenting LLM Agents with Learned Collaborative Intervention
by: Wang, Zhiming, et al.
Published: (2026)

PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference
by: Wang, Qirui, et al.
Published: (2026)

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention
by: Liao, Mengqi, et al.
Published: (2026)

SlimPack: Fine-Grained Asymmetric Packing for Balanced and Efficient Variable-Length LLM Training
by: Liu, Yuliang, et al.
Published: (2025)

SelectiveShield: Lightweight Hybrid Defense Against Gradient Leakage in Federated Learning
by: Li, Borui, et al.
Published: (2025)

REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
by: Tang, Annabelle Sujun, et al.
Published: (2025)

LLM Inference Serving: Survey of Recent Advances and Opportunities
by: Li, Baolin, et al.
Published: (2024)

TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors
by: Hsieh, Jun-Wei, et al.
Published: (2026)

MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection
by: Lei, Xiaochun, et al.
Published: (2025)

Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective
by: Wadlom, Noppanat, et al.
Published: (2026)

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
by: Jayakody, Shakya, et al.
Published: (2026)

EF-LLM: Energy Forecasting LLM with AI-assisted Automation, Enhanced Sparse Prediction, Hallucination Detection
by: Qiu, Zihang, et al.
Published: (2024)

Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving
by: Huan, Chengying, et al.
Published: (2025)

Hybrid Inverse Reinforcement Learning
by: Ren, Juntao, et al.
Published: (2024)

Writing Like the Best: Exemplar-Based Expository Text Generation
by: Liu, Yuxiang, et al.
Published: (2025)

MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation
by: Chen, Ming, et al.
Published: (2025)

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
by: Wan, Borui, et al.
Published: (2024)