Saved in:
| Main Authors: | Liu, Hao, Huang, Ye, Huang, Chenghuan, Zheng, Zhenyi, Du, Jiangsu, Ma, Ziyang, Lyu, Jing, Lu, Yutong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.04451 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025)
by: Du, Jiangsu, et al.
Published: (2025)
TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference
by: Zhang, Hongbin, et al.
Published: (2025)
by: Zhang, Hongbin, et al.
Published: (2025)
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)
by: Wei, Jinhui, et al.
Published: (2025)
Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration
by: Wei, Yuanxin, et al.
Published: (2025)
by: Wei, Yuanxin, et al.
Published: (2025)
AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System
by: Bai, Fengyao, et al.
Published: (2026)
by: Bai, Fengyao, et al.
Published: (2026)
SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation
by: Cheng, Shenggan, et al.
Published: (2025)
by: Cheng, Shenggan, et al.
Published: (2025)
Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference
by: Ye, Shengyuan, et al.
Published: (2024)
by: Ye, Shengyuan, et al.
Published: (2024)
StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving
by: Nouri, Azam
Published: (2026)
by: Nouri, Azam
Published: (2026)
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
by: Gao, Shihong, et al.
Published: (2025)
by: Gao, Shihong, et al.
Published: (2025)
LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference
by: Bansal, Harsh Vardhan
Published: (2025)
by: Bansal, Harsh Vardhan
Published: (2025)
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
by: Huang, Kai, et al.
Published: (2025)
by: Huang, Kai, et al.
Published: (2025)
HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration
by: Huang, Yushi, et al.
Published: (2024)
by: Huang, Yushi, et al.
Published: (2024)
AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse
by: Yu, Zichao, et al.
Published: (2025)
by: Yu, Zichao, et al.
Published: (2025)
Towards Efficient Multi-Scale Deformable Attention on NPU
by: Huang, Chenghuan, et al.
Published: (2025)
by: Huang, Chenghuan, et al.
Published: (2025)
X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference
by: Zeng, Yixiao, et al.
Published: (2026)
by: Zeng, Yixiao, et al.
Published: (2026)
Accelerating Diffusion Transformers with Token-wise Feature Caching
by: Zou, Chang, et al.
Published: (2024)
by: Zou, Chang, et al.
Published: (2024)
SCORPIO: Serving the Right Requests at the Right Time for Heterogeneous SLOs in LLM Inference
by: Tang, Yinghao, et al.
Published: (2025)
by: Tang, Yinghao, et al.
Published: (2025)
PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving
by: Wang, Wenfeng, et al.
Published: (2026)
by: Wang, Wenfeng, et al.
Published: (2026)
CacheClip: Accelerating RAG with Effective KV Cache Reuse
by: Yang, Bin, et al.
Published: (2025)
by: Yang, Bin, et al.
Published: (2025)
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)
by: li, Fei, et al.
Published: (2026)
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
by: Zou, Chang, et al.
Published: (2026)
by: Zou, Chang, et al.
Published: (2026)
Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models
by: Ma, Xuran, et al.
Published: (2025)
by: Ma, Xuran, et al.
Published: (2025)
SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
by: Liu, Jiacheng, et al.
Published: (2025)
by: Liu, Jiacheng, et al.
Published: (2025)
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
by: Liu, Joseph, et al.
Published: (2024)
by: Liu, Joseph, et al.
Published: (2024)
EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
by: Guo, Tianyu, et al.
Published: (2025)
by: Guo, Tianyu, et al.
Published: (2025)
Accelerating Diffusion Transformer via Gradient-Optimized Cache
by: Qiu, Junxiang, et al.
Published: (2025)
by: Qiu, Junxiang, et al.
Published: (2025)
Accelerating Diffusion Transformer via Error-Optimized Cache
by: Qiu, Junxiang, et al.
Published: (2025)
by: Qiu, Junxiang, et al.
Published: (2025)
RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
by: Geng, Yingsheng, et al.
Published: (2026)
by: Geng, Yingsheng, et al.
Published: (2026)
BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
by: Cui, Hanshuai, et al.
Published: (2025)
by: Cui, Hanshuai, et al.
Published: (2025)
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
by: Bai, Yushi, et al.
Published: (2026)
by: Bai, Yushi, et al.
Published: (2026)
Token Caching for Diffusion Transformer Acceleration
by: Lou, Jinming, et al.
Published: (2024)
by: Lou, Jinming, et al.
Published: (2024)
AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
by: Wang, Ying, et al.
Published: (2025)
by: Wang, Ying, et al.
Published: (2025)
From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
by: Liu, Jiacheng, et al.
Published: (2025)
by: Liu, Jiacheng, et al.
Published: (2025)
Optimizing Few-Step Sampler for Diffusion Probabilistic Model
by: Huang, Jen-Yuan
Published: (2024)
by: Huang, Jen-Yuan
Published: (2024)
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
by: Ma, Xinyin, et al.
Published: (2024)
by: Ma, Xinyin, et al.
Published: (2024)
OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
by: Chu, Huanpeng, et al.
Published: (2025)
by: Chu, Huanpeng, et al.
Published: (2025)
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
by: Dong, Yanhao, et al.
Published: (2025)
by: Dong, Yanhao, et al.
Published: (2025)
Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
by: Liu, Tianyi, et al.
Published: (2026)
by: Liu, Tianyi, et al.
Published: (2026)
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
by: Gim, In, et al.
Published: (2023)
by: Gim, In, et al.
Published: (2023)
Similar Items
-
EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025) -
TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference
by: Zhang, Hongbin, et al.
Published: (2025) -
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025) -
Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration
by: Wei, Yuanxin, et al.
Published: (2025) -
AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System
by: Bai, Fengyao, et al.
Published: (2026)