Saved in:
| Main Authors: | Zhou, Zhongchun, Lai, Chengtao, Gu, Yuhang, Zhang, Wei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.07312 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling
by: Zhou, Zhongchun, et al.
Published: (2025)
by: Zhou, Zhongchun, et al.
Published: (2025)
PiKV: KV Cache Management System for Mixture of Experts
by: Liu, Dong, et al.
Published: (2025)
by: Liu, Dong, et al.
Published: (2025)
Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
by: Zhu, Wenbin, et al.
Published: (2025)
by: Zhu, Wenbin, et al.
Published: (2025)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)
The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths
by: Graziano, Marco
Published: (2026)
by: Graziano, Marco
Published: (2026)
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
by: Stojkovic, Jovan, et al.
Published: (2024)
by: Stojkovic, Jovan, et al.
Published: (2024)
ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs
by: Lei, Jianlong, et al.
Published: (2026)
by: Lei, Jianlong, et al.
Published: (2026)
Investigating Memory Failure Prediction Across CPU Architectures
by: Yu, Qiao, et al.
Published: (2024)
by: Yu, Qiao, et al.
Published: (2024)
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
by: Li, Jonathan, et al.
Published: (2025)
by: Li, Jonathan, et al.
Published: (2025)
Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
by: Renney, Harri, et al.
Published: (2026)
by: Renney, Harri, et al.
Published: (2026)
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
by: Qin, Ruoyu, et al.
Published: (2024)
by: Qin, Ruoyu, et al.
Published: (2024)
Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator
by: Peccia, Federico Nicolas, et al.
Published: (2024)
by: Peccia, Federico Nicolas, et al.
Published: (2024)
HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures
by: Liu, Fangxin, et al.
Published: (2026)
by: Liu, Fangxin, et al.
Published: (2026)
PhD Thesis Summary: Methods for Reliability Assessment and Enhancement of Deep Neural Network Hardware Accelerators
by: Taheri, Mahdi
Published: (2026)
by: Taheri, Mahdi
Published: (2026)
EdgeReasoning: Characterizing Reasoning LLM Deployment on Edge GPUs
by: Kubwimana, Benjamin, et al.
Published: (2025)
by: Kubwimana, Benjamin, et al.
Published: (2025)
Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
by: Zheng, Xianzhe, et al.
Published: (2026)
by: Zheng, Xianzhe, et al.
Published: (2026)
RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators
by: Tang, Xinsheng, et al.
Published: (2026)
by: Tang, Xinsheng, et al.
Published: (2026)
Improving AI Efficiency in Data Centres by Power Dynamic Response
by: Marinoni, Andrea, et al.
Published: (2025)
by: Marinoni, Andrea, et al.
Published: (2025)
A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference
by: DeBole, Michael V., et al.
Published: (2025)
by: DeBole, Michael V., et al.
Published: (2025)
Intent-Driven Storage Systems: From Low-Level Tuning to High-Level Understanding
by: Bergman, Shai, et al.
Published: (2025)
by: Bergman, Shai, et al.
Published: (2025)
NPU Design for Diffusion Language Model Inference
by: Lou, Binglei, et al.
Published: (2026)
by: Lou, Binglei, et al.
Published: (2026)
FengHuang: Next-Generation Memory Orchestration for AI Inferencing
by: Li, Jiamin, et al.
Published: (2025)
by: Li, Jiamin, et al.
Published: (2025)
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)
by: li, Fei, et al.
Published: (2026)
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
by: Zhang, Qijun, et al.
Published: (2026)
by: Zhang, Qijun, et al.
Published: (2026)
Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework
by: Stojkovic, Jovan, et al.
Published: (2025)
by: Stojkovic, Jovan, et al.
Published: (2025)
DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs
by: Hu, Ziyu, et al.
Published: (2025)
by: Hu, Ziyu, et al.
Published: (2025)
HLS4PC: A Parametrizable Framework For Accelerating Point-Based 3D Point Cloud Models on FPGA
by: Pal, Amur Saqib, et al.
Published: (2025)
by: Pal, Amur Saqib, et al.
Published: (2025)
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
by: Vellaisamy, Prabhu, et al.
Published: (2025)
by: Vellaisamy, Prabhu, et al.
Published: (2025)
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
by: Tian, Yuyang, et al.
Published: (2025)
by: Tian, Yuyang, et al.
Published: (2025)
Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference
by: Zhao, Yiren, et al.
Published: (2026)
by: Zhao, Yiren, et al.
Published: (2026)
Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale
by: Zhao, Dan, et al.
Published: (2024)
by: Zhao, Dan, et al.
Published: (2024)
Modernizing Amdahl's Law: How AI Scaling Laws Shape Computer Architecture
by: Lu, Chien-Ping
Published: (2026)
by: Lu, Chien-Ping
Published: (2026)
Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
by: Chen, Jiesong, et al.
Published: (2026)
by: Chen, Jiesong, et al.
Published: (2026)
Co-design of a novel CMOS highly parallel, low-power, multi-chip neural network accelerator
by: Hokenmaier, W, et al.
Published: (2024)
by: Hokenmaier, W, et al.
Published: (2024)
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)
by: Pan, Yudong, et al.
Published: (2026)
Exploring energy consumption of AI frameworks on a 64-core RV64 Server CPU
by: Malenza, Giulio, et al.
Published: (2025)
by: Malenza, Giulio, et al.
Published: (2025)
Good things come in small packages: Should we build AI clusters with Lite-GPUs?
by: Canakci, Burcu, et al.
Published: (2025)
by: Canakci, Burcu, et al.
Published: (2025)
Power Stabilization for AI Training Datacenters
by: Choukse, Esha, et al.
Published: (2025)
by: Choukse, Esha, et al.
Published: (2025)
Strict Partitioning for Sporadic Rigid Gang Tasks
by: Sun, Binqi, et al.
Published: (2024)
by: Sun, Binqi, et al.
Published: (2024)
ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation
by: Dorairaj, Nij, et al.
Published: (2026)
by: Dorairaj, Nij, et al.
Published: (2026)
Similar Items
-
LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling
by: Zhou, Zhongchun, et al.
Published: (2025) -
PiKV: KV Cache Management System for Mixture of Experts
by: Liu, Dong, et al.
Published: (2025) -
Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
by: Zhu, Wenbin, et al.
Published: (2025) -
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025) -
The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths
by: Graziano, Marco
Published: (2026)