:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhou, Zhongchun, Lai, Chengtao, Gu, Yuhang, Zhang, Wei
Format:	Preprint
Published:	2025
Subjects:	Hardware Architecture Artificial Intelligence Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2512.07312
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling
by: Zhou, Zhongchun, et al.
Published: (2025)

PiKV: KV Cache Management System for Mixture of Experts
by: Liu, Dong, et al.
Published: (2025)

Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
by: Zhu, Wenbin, et al.
Published: (2025)

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
by: Yüzügüler, Ahmet Caner, et al.
Published: (2025)

The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths
by: Graziano, Marco
Published: (2026)

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
by: Stojkovic, Jovan, et al.
Published: (2024)

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs
by: Lei, Jianlong, et al.
Published: (2026)

Investigating Memory Failure Prediction Across CPU Architectures
by: Yu, Qiao, et al.
Published: (2024)

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
by: Li, Jonathan, et al.
Published: (2025)

Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
by: Renney, Harri, et al.
Published: (2026)

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
by: Qin, Ruoyu, et al.
Published: (2024)

Efficient Edge AI: Deploying Convolutional Neural Networks on FPGA with the Gemmini Accelerator
by: Peccia, Federico Nicolas, et al.
Published: (2024)

HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures
by: Liu, Fangxin, et al.
Published: (2026)

PhD Thesis Summary: Methods for Reliability Assessment and Enhancement of Deep Neural Network Hardware Accelerators
by: Taheri, Mahdi
Published: (2026)

EdgeReasoning: Characterizing Reasoning LLM Deployment on Edge GPUs
by: Kubwimana, Benjamin, et al.
Published: (2025)

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
by: Zheng, Xianzhe, et al.
Published: (2026)

RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators
by: Tang, Xinsheng, et al.
Published: (2026)

Improving AI Efficiency in Data Centres by Power Dynamic Response
by: Marinoni, Andrea, et al.
Published: (2025)

A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference
by: DeBole, Michael V., et al.
Published: (2025)

Intent-Driven Storage Systems: From Low-Level Tuning to High-Level Understanding
by: Bergman, Shai, et al.
Published: (2025)

NPU Design for Diffusion Language Model Inference
by: Lou, Binglei, et al.
Published: (2026)

FengHuang: Next-Generation Memory Orchestration for AI Inferencing
by: Li, Jiamin, et al.
Published: (2025)

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)

Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
by: Zhang, Qijun, et al.
Published: (2026)

Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework
by: Stojkovic, Jovan, et al.
Published: (2025)

DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs
by: Hu, Ziyu, et al.
Published: (2025)

HLS4PC: A Parametrizable Framework For Accelerating Point-Based 3D Point Cloud Models on FPGA
by: Pal, Amur Saqib, et al.
Published: (2025)

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
by: Vellaisamy, Prabhu, et al.
Published: (2025)

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
by: Tian, Yuyang, et al.
Published: (2025)

Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference
by: Zhao, Yiren, et al.
Published: (2026)

Sustainable Supercomputing for AI: GPU Power Capping at HPC Scale
by: Zhao, Dan, et al.
Published: (2024)

Modernizing Amdahl's Law: How AI Scaling Laws Shape Computer Architecture
by: Lu, Chien-Ping
Published: (2026)

Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
by: Chen, Jiesong, et al.
Published: (2026)

Co-design of a novel CMOS highly parallel, low-power, multi-chip neural network accelerator
by: Hokenmaier, W, et al.
Published: (2024)

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)

Exploring energy consumption of AI frameworks on a 64-core RV64 Server CPU
by: Malenza, Giulio, et al.
Published: (2025)

Good things come in small packages: Should we build AI clusters with Lite-GPUs?
by: Canakci, Burcu, et al.
Published: (2025)

Power Stabilization for AI Training Datacenters
by: Choukse, Esha, et al.
Published: (2025)

Strict Partitioning for Sporadic Rigid Gang Tasks
by: Sun, Binqi, et al.
Published: (2024)

ODIN-Based CPU-GPU Architecture with Replay-Driven Simulation and Emulation
by: Dorairaj, Nij, et al.
Published: (2026)