Saved in:
| Main Authors: | Pang, Bowen, Li, Kai, She, Ruifeng, Wang, Feifan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.15763 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Serving Large Language Models on Huawei CloudMatrix384
by: Zuo, Pengfei, et al.
Published: (2025)
by: Zuo, Pengfei, et al.
Published: (2025)
Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization
by: Chen, Hao Mark, et al.
Published: (2025)
by: Chen, Hao Mark, et al.
Published: (2025)
Online GPU Energy Optimization with Switching-Aware Bandits
by: Xu, Xiongxiao, et al.
Published: (2024)
by: Xu, Xiongxiao, et al.
Published: (2024)
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
by: Yu, Zhongkai, et al.
Published: (2025)
by: Yu, Zhongkai, et al.
Published: (2025)
Llumnix: Dynamic Scheduling for Large Language Model Serving
by: Sun, Biao, et al.
Published: (2024)
by: Sun, Biao, et al.
Published: (2024)
WaferLLM: Large Language Model Inference at Wafer Scale
by: He, Congjie, et al.
Published: (2025)
by: He, Congjie, et al.
Published: (2025)
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
by: He, Yintao, et al.
Published: (2025)
by: He, Yintao, et al.
Published: (2025)
ZettaLith: An Architectural Exploration of Extreme-Scale AI Inference Acceleration
by: Silverbrook, Kia
Published: (2025)
by: Silverbrook, Kia
Published: (2025)
SCAR: Scheduling Multi-Model AI Workloads on Heterogeneous Multi-Chiplet Module Accelerators
by: Odema, Mohanad, et al.
Published: (2024)
by: Odema, Mohanad, et al.
Published: (2024)
SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving
by: Kakolyris, Andreas Kosmas, et al.
Published: (2024)
by: Kakolyris, Andreas Kosmas, et al.
Published: (2024)
MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference
by: Bambhaniya, Abhimanyu Rajeshkumar, et al.
Published: (2025)
by: Bambhaniya, Abhimanyu Rajeshkumar, et al.
Published: (2025)
Demystifying AI Platform Design for Distributed Inference of Next-Generation LLM models
by: Bambhaniya, Abhimanyu, et al.
Published: (2024)
by: Bambhaniya, Abhimanyu, et al.
Published: (2024)
PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
by: Yeo, Gwangoo, et al.
Published: (2024)
by: Yeo, Gwangoo, et al.
Published: (2024)
Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference
by: Kundu, Joyjit, et al.
Published: (2024)
by: Kundu, Joyjit, et al.
Published: (2024)
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
by: Choi, Yuseon, et al.
Published: (2026)
by: Choi, Yuseon, et al.
Published: (2026)
NPU Design for Diffusion Language Model Inference
by: Lou, Binglei, et al.
Published: (2026)
by: Lou, Binglei, et al.
Published: (2026)
PIM-Opt: Demystifying Distributed Optimization Algorithms on a Real-World Processing-In-Memory System
by: Rhyner, Steve, et al.
Published: (2024)
by: Rhyner, Steve, et al.
Published: (2024)
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
by: Yang, Jinwu, et al.
Published: (2026)
by: Yang, Jinwu, et al.
Published: (2026)
Enabling Accelerators for Graph Computing
by: Shivdikar, Kaustubh
Published: (2023)
by: Shivdikar, Kaustubh
Published: (2023)
Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
by: Park, Jeongmin Brian, et al.
Published: (2023)
by: Park, Jeongmin Brian, et al.
Published: (2023)
Sustainable AI Training via Hardware-Software Co-Design on NVIDIA, AMD, and Emerging GPU Architectures
by: Makin, Yashasvi, et al.
Published: (2025)
by: Makin, Yashasvi, et al.
Published: (2025)
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
by: Yu, Zhongkai, et al.
Published: (2026)
by: Yu, Zhongkai, et al.
Published: (2026)
Splitwiser: Efficient LM inference with constrained resources
by: Aali, Asad, et al.
Published: (2025)
by: Aali, Asad, et al.
Published: (2025)
CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead
by: Ye, Jinpeng, et al.
Published: (2026)
by: Ye, Jinpeng, et al.
Published: (2026)
Sensitivity-Guided Framework for Pruned and Quantized Reservoir Computing Accelerators
by: Jafari, Atousa, et al.
Published: (2026)
by: Jafari, Atousa, et al.
Published: (2026)
FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern
by: Shen, Ao, et al.
Published: (2025)
by: Shen, Ao, et al.
Published: (2025)
VLSI Hypergraph Partitioning with Deep Learning
by: Khan, Muhammad Hadir, et al.
Published: (2024)
by: Khan, Muhammad Hadir, et al.
Published: (2024)
HyperOffload: Graph-Driven Hierarchical Memory Management for Large Language Models on SuperNode Architectures
by: Liu, Fangxin, et al.
Published: (2026)
by: Liu, Fangxin, et al.
Published: (2026)
Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving
by: Ding, Jianru, et al.
Published: (2026)
by: Ding, Jianru, et al.
Published: (2026)
Co-design of a novel CMOS highly parallel, low-power, multi-chip neural network accelerator
by: Hokenmaier, W, et al.
Published: (2024)
by: Hokenmaier, W, et al.
Published: (2024)
Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference
by: Zhao, Yiren, et al.
Published: (2026)
by: Zhao, Yiren, et al.
Published: (2026)
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures
by: Vellaisamy, Prabhu, et al.
Published: (2025)
by: Vellaisamy, Prabhu, et al.
Published: (2025)
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
by: Stojkovic, Jovan, et al.
Published: (2024)
by: Stojkovic, Jovan, et al.
Published: (2024)
Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications
by: Li, Jiaxi, et al.
Published: (2025)
by: Li, Jiaxi, et al.
Published: (2025)
Performance and Power: Systematic Evaluation of AI Workloads on Accelerators with CARAML
by: John, Chelsea Maria, et al.
Published: (2024)
by: John, Chelsea Maria, et al.
Published: (2024)
The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution
by: Panigrahy, Deepak, et al.
Published: (2026)
by: Panigrahy, Deepak, et al.
Published: (2026)
Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective
by: Shi, Tianyao, et al.
Published: (2025)
by: Shi, Tianyao, et al.
Published: (2025)
Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
by: Elbtity, Mohammed, et al.
Published: (2024)
by: Elbtity, Mohammed, et al.
Published: (2024)
Deep Reinforcement Learning based Online Scheduling Policy for Deep Neural Network Multi-Tenant Multi-Accelerator Systems
by: Blanco, Francesco G., et al.
Published: (2024)
by: Blanco, Francesco G., et al.
Published: (2024)
Towards Fair and Firm Real-Time Scheduling in DNN Multi-Tenant Multi-Accelerator Systems via Reinforcement Learning
by: Russo, Enrico, et al.
Published: (2024)
by: Russo, Enrico, et al.
Published: (2024)
Similar Items
-
Serving Large Language Models on Huawei CloudMatrix384
by: Zuo, Pengfei, et al.
Published: (2025) -
Advancing AI-assisted Hardware Design with Hierarchical Decentralized Training and Personalized Inference-Time Optimization
by: Chen, Hao Mark, et al.
Published: (2025) -
Online GPU Energy Optimization with Switching-Aware Bandits
by: Xu, Xiongxiao, et al.
Published: (2024) -
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
by: Yu, Zhongkai, et al.
Published: (2025) -
Llumnix: Dynamic Scheduling for Large Language Model Serving
by: Sun, Biao, et al.
Published: (2024)