:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Hao, Huang, Ye, Huang, Chenghuan, Zheng, Zhenyi, Du, Jiangsu, Ma, Ziyang, Lyu, Jing, Lu, Yutong
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.04451
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

EcoServe: Enabling Cost-effective LLM Serving with Proactive Intra- and Inter-Instance Orchestration
by: Du, Jiangsu, et al.
Published: (2025)

TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference
by: Zhang, Hongbin, et al.
Published: (2025)

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
by: Wei, Jinhui, et al.
Published: (2025)

Adaptive Hybrid Caching for Efficient Text-to-Video Diffusion Model Acceleration
by: Wei, Yuanxin, et al.
Published: (2025)

AlignedServe: Orchestrating Prefix-aware Batching to Build a High-throughput and Computing-efficient LLM Serving System
by: Bai, Fengyao, et al.
Published: (2026)

SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation
by: Cheng, Shenggan, et al.
Published: (2025)

Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference
by: Ye, Shengyuan, et al.
Published: (2024)

StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving
by: Nouri, Azam
Published: (2026)

Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
by: Gao, Shihong, et al.
Published: (2025)

LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference
by: Bansal, Harsh Vardhan
Published: (2025)

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
by: Guo, Tianyu, et al.
Published: (2025)

AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference
by: Huang, Kai, et al.
Published: (2025)

HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration
by: Huang, Yushi, et al.
Published: (2024)

AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse
by: Yu, Zichao, et al.
Published: (2025)

Towards Efficient Multi-Scale Deformable Attention on NPU
by: Huang, Chenghuan, et al.
Published: (2025)

X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference
by: Zeng, Yixiao, et al.
Published: (2026)

Accelerating Diffusion Transformers with Token-wise Feature Caching
by: Zou, Chang, et al.
Published: (2024)

SCORPIO: Serving the Right Requests at the Right Time for Heterogeneous SLOs in LLM Inference
by: Tang, Yinghao, et al.
Published: (2025)

PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving
by: Wang, Wenfeng, et al.
Published: (2026)

CacheClip: Accelerating RAG with Effective KV Cache Reuse
by: Yang, Bin, et al.
Published: (2025)

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
by: Zou, Chang, et al.
Published: (2026)

Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models
by: Ma, Xuran, et al.
Published: (2025)

SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
by: Liu, Jiacheng, et al.
Published: (2025)

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
by: Liu, Joseph, et al.
Published: (2024)

EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
by: Guo, Tianyu, et al.
Published: (2025)

Accelerating Diffusion Transformer via Gradient-Optimized Cache
by: Qiu, Junxiang, et al.
Published: (2025)

Accelerating Diffusion Transformer via Error-Optimized Cache
by: Qiu, Junxiang, et al.
Published: (2025)

RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse
by: Geng, Yingsheng, et al.
Published: (2026)

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
by: Cui, Hanshuai, et al.
Published: (2025)

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
by: Bai, Yushi, et al.
Published: (2026)

Token Caching for Diffusion Transformer Acceleration
by: Lou, Jinming, et al.
Published: (2024)

AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
by: Wang, Ying, et al.
Published: (2025)

From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
by: Liu, Jiacheng, et al.
Published: (2025)

Optimizing Few-Step Sampler for Diffusion Probabilistic Model
by: Huang, Jen-Yuan
Published: (2024)

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
by: Ma, Xinyin, et al.
Published: (2024)

OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
by: Chu, Huanpeng, et al.
Published: (2025)

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
by: Dong, Yanhao, et al.
Published: (2025)

Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
by: Liu, Tianyi, et al.
Published: (2026)

Prompt Cache: Modular Attention Reuse for Low-Latency Inference
by: Gim, In, et al.
Published: (2023)