Saved in:
| Main Authors: | Gao, Bin, He, Zhuomin, Sharma, Puru, Kang, Qingxuan, Jevdjic, Djordje, Deng, Junbo, Yang, Xingkun, Yu, Zhou, Zuo, Pengfei |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.19708 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
numaPTE: Managing Page-Tables and TLBs on NUMA Systems
by: Gao, Bin, et al.
Published: (2024)
by: Gao, Bin, et al.
Published: (2024)
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)
by: Zhou, Qihui, et al.
Published: (2025)
Continuous Semantic Caching for Low-Cost LLM Serving
by: Atalar, Baran, et al.
Published: (2026)
by: Atalar, Baran, et al.
Published: (2026)
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
by: He, Zhuomin, et al.
Published: (2025)
by: He, Zhuomin, et al.
Published: (2025)
SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)
by: Zhou, Qihui, et al.
Published: (2025)
InstCache: A Predictive Cache for LLM Serving
by: Zou, Longwei, et al.
Published: (2024)
by: Zou, Longwei, et al.
Published: (2024)
DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving
by: Yuan, Ying, et al.
Published: (2026)
by: Yuan, Ying, et al.
Published: (2026)
IC-Cache: Efficient Large Language Model Serving via In-context Caching
by: Yu, Yifan, et al.
Published: (2025)
by: Yu, Yifan, et al.
Published: (2025)
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
by: Liu, Xutong, et al.
Published: (2025)
by: Liu, Xutong, et al.
Published: (2025)
Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading
by: Kim, Kihyun, et al.
Published: (2025)
by: Kim, Kihyun, et al.
Published: (2025)
Bipartite Graph Attention-based Clustering for Large-scale scRNA-seq Data
by: Liang, Zhuomin, et al.
Published: (2026)
by: Liang, Zhuomin, et al.
Published: (2026)
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
by: Cheng, Xueqi, et al.
Published: (2026)
by: Cheng, Xueqi, et al.
Published: (2026)
On the Multi-turn Instruction Following for Conversational Web Agents
by: Deng, Yang, et al.
Published: (2024)
by: Deng, Yang, et al.
Published: (2024)
EPIC: Efficient Position-Independent Caching for Serving Large Language Models
by: Hu, Junhao, et al.
Published: (2024)
by: Hu, Junhao, et al.
Published: (2024)
TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
by: Liang, Yunkai, et al.
Published: (2025)
by: Liang, Yunkai, et al.
Published: (2025)
Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management
by: Qianli, Liu, et al.
Published: (2025)
by: Qianli, Liu, et al.
Published: (2025)
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
by: Gao, Wei, et al.
Published: (2025)
by: Gao, Wei, et al.
Published: (2025)
Multi-turn Consistent Image Editing
by: Zhou, Zijun, et al.
Published: (2025)
by: Zhou, Zijun, et al.
Published: (2025)
Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
by: Wang, Shaoyu, et al.
Published: (2025)
by: Wang, Shaoyu, et al.
Published: (2025)
Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving
by: Li, Zongze, et al.
Published: (2026)
by: Li, Zongze, et al.
Published: (2026)
$A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving
by: Zhou, Yuechi, et al.
Published: (2025)
by: Zhou, Yuechi, et al.
Published: (2025)
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
by: Gao, Shihong, et al.
Published: (2025)
by: Gao, Shihong, et al.
Published: (2025)
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
by: Nian, Sean, et al.
Published: (2026)
by: Nian, Sean, et al.
Published: (2026)
Arabic Multi-turn Conversational Synthetic Dataset
by: Misbah, Ahmed, et al.
Published: (2025)
by: Misbah, Ahmed, et al.
Published: (2025)
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
by: Yao, Jiayi, et al.
Published: (2024)
by: Yao, Jiayi, et al.
Published: (2024)
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing
by: Chen, Kaiwen, et al.
Published: (2025)
by: Chen, Kaiwen, et al.
Published: (2025)
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
by: Liu, Yuhan, et al.
Published: (2023)
by: Liu, Yuhan, et al.
Published: (2023)
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
by: Song, Dinghong, et al.
Published: (2025)
by: Song, Dinghong, et al.
Published: (2025)
(110) Facet of MgTe Zinc Blende Semiconductor: A Holy Grail for Modern Spintronics
by: Mohanta, Manish Kumar, et al.
Published: (2023)
by: Mohanta, Manish Kumar, et al.
Published: (2023)
Investigating Quantum Spin-Textures Using Universal MJ Hamiltonians
by: Mohanta, Manish Kumar, et al.
Published: (2024)
by: Mohanta, Manish Kumar, et al.
Published: (2024)
Serving Large Language Models on Huawei CloudMatrix384
by: Zuo, Pengfei, et al.
Published: (2025)
by: Zuo, Pengfei, et al.
Published: (2025)
MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches
by: Wang, Xin, et al.
Published: (2026)
by: Wang, Xin, et al.
Published: (2026)
MEPIC: Memory Efficient Position Independent Caching for LLM Serving
by: Wang, Qian, et al.
Published: (2025)
by: Wang, Qian, et al.
Published: (2025)
Efficient Multi-round LLM Inference over Disaggregated Serving
by: He, Wenhao, et al.
Published: (2026)
by: He, Wenhao, et al.
Published: (2026)
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
by: Yu, Shan, et al.
Published: (2025)
by: Yu, Shan, et al.
Published: (2025)
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
by: Tian, Yuyang, et al.
Published: (2025)
by: Tian, Yuyang, et al.
Published: (2025)
Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space
by: Chen, Zhiliang, et al.
Published: (2025)
by: Chen, Zhiliang, et al.
Published: (2025)
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
by: Zhang, Hang, et al.
Published: (2025)
by: Zhang, Hang, et al.
Published: (2025)
Source-primed Multi-turn Conversation Helps Large Language Models Translate Documents
by: Hu, Hanxu, et al.
Published: (2025)
by: Hu, Hanxu, et al.
Published: (2025)
Similar Items
-
numaPTE: Managing Page-Tables and TLBs on NUMA Systems
by: Gao, Bin, et al.
Published: (2024) -
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
by: Zhou, Qihui, et al.
Published: (2025) -
Continuous Semantic Caching for Low-Cost LLM Serving
by: Atalar, Baran, et al.
Published: (2026) -
AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
by: He, Zhuomin, et al.
Published: (2025) -
SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving
by: Zhou, Qihui, et al.
Published: (2025)