:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Pang, Bowen, Li, Kai, She, Ruifeng, Wang, Feifan
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Hardware Architecture Machine Learning
Online Access:	https://arxiv.org/abs/2502.15763
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Serving Large Language Models on Huawei CloudMatrix384
by: Zuo, Pengfei, et al.
Published: (2025)

Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization
by: Chen, Hao Mark, et al.
Published: (2025)

Online GPU Energy Optimization with Switching-Aware Bandits
by: Xu, Xiongxiao, et al.
Published: (2024)

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
by: Yu, Zhongkai, et al.
Published: (2025)

Llumnix: Dynamic Scheduling for Large Language Model Serving
by: Sun, Biao, et al.
Published: (2024)

WaferLLM: Large Language Model Inference at Wafer Scale
by: He, Congjie, et al.
Published: (2025)

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
by: He, Yintao, et al.
Published: (2025)

ZettaLith: An Architectural Exploration of Extreme-Scale AI Inference Acceleration
by: Silverbrook, Kia
Published: (2025)

SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators
by: Odema, Mohanad, et al.
Published: (2024)

SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving
by: Kakolyris, Andreas Kosmas, et al.
Published: (2024)

MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference
by: Bambhaniya, Abhimanyu Rajeshkumar, et al.
Published: (2025)

Demystifying AI Platform Design for Distributed Inference of Next-Generation LLM models
by: Bambhaniya, Abhimanyu, et al.
Published: (2024)

PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
by: Yeo, Gwangoo, et al.
Published: (2024)

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference
by: Kundu, Joyjit, et al.
Published: (2024)

ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
by: Choi, Yuseon, et al.
Published: (2026)

NPU Design for Diffusion Language Model Inference
by: Lou, Binglei, et al.
Published: (2026)

PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System
by: Rhyner, Steve, et al.
Published: (2024)

ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
by: Yang, Jinwu, et al.
Published: (2026)

Enabling Accelerators for Graph Computing
by: Shivdikar, Kaustubh
Published: (2023)

Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
by: Park, Jeongmin Brian, et al.
Published: (2023)

Sustainable AI Training via Hardware-Software Co-Design on NVIDIA, AMD, and Emerging GPU Architectures
by: Makin, Yashasvi, et al.
Published: (2025)

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
by: Yu, Zhongkai, et al.
Published: (2026)

Splitwiser: Efficient LM inference with constrained resources
by: Aali, Asad, et al.
Published: (2025)

CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead
by: Ye, Jinpeng, et al.
Published: (2026)

Sensitivity-Guided Framework for Pruned and Quantized Reservoir Computing Accelerators
by: Jafari, Atousa, et al.
Published: (2026)

FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern
by: Shen, Ao, et al.
Published: (2025)

VLSI Hypergraph Partitioning with Deep Learning
by: Khan, Muhammad Hadir, et al.
Published: (2024)

HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures
by: Liu, Fangxin, et al.
Published: (2026)

Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving
by: Ding, Jianru, et al.
Published: (2026)

Co-design of a novel CMOS highly parallel, low-power, multi-chip neural network accelerator
by: Hokenmaier, W, et al.
Published: (2024)

Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference
by: Zhao, Yiren, et al.
Published: (2026)

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
by: Vellaisamy, Prabhu, et al.
Published: (2025)

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
by: Stojkovic, Jovan, et al.
Published: (2024)

Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications
by: Li, Jiaxi, et al.
Published: (2025)

Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML
by: John, Chelsea Maria, et al.
Published: (2024)

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution
by: Panigrahy, Deepak, et al.
Published: (2026)

Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective
by: Shi, Tianyao, et al.
Published: (2025)

Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
by: Elbtity, Mohammed, et al.
Published: (2024)

Deep Reinforcement Learning based Online Scheduling Policy for Deep Neural Network Multi-Tenant Multi-Accelerator Systems
by: Blanco, Francesco G., et al.
Published: (2024)

Towards Fair and Firm Real-Time Scheduling in DNN Multi-Tenant Multi-Accelerator Systems via Reinforcement Learning
by: Russo, Enrico, et al.
Published: (2024)