:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Wenbin, Shen, Zhaoyan, Shao, Zili, Dai, Hongjun, Chen, Feng
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence Hardware Architecture
Online Access:	https://arxiv.org/abs/2512.01357
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2024)

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)

Application Experiences on a GPU-Accelerated Arm-based HPC Testbed
by: Elwasif, Wael, et al.
Published: (2022)

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference
by: Chung, Euijun, et al.
Published: (2026)

An Evaluation and Comparison of GPU Hardware and Solver Libraries for Accelerating the OPM Flow Reservoir Simulator
by: Qiu, Tong Dong, et al.
Published: (2023)

Accelerating Triangle Counting with Real Processing-in-Memory Systems
by: Asquini, Lorenzo, et al.
Published: (2025)

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory
by: Ibrahim, Mohamed Assem, et al.
Published: (2024)

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)

Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency
by: Kurzynski, Marco, et al.
Published: (2025)

CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories
by: Shi, Man, et al.
Published: (2024)

CELLO: Co-designing Schedule and Hybrid Implicit/Explicit Buffer for Complex Tensor Reuse
by: Garg, Raveesh, et al.
Published: (2023)

Accelerating Recommender Model ETL with a Streaming FPGA-GPU Dataflow
by: Zhu, Yu, et al.
Published: (2025)

MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration
by: Kubo, Tatsuya, et al.
Published: (2025)

PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System
by: Liu, Lian, et al.
Published: (2026)

HetGPU: The pursuit of making binary compatibility towards GPUs
by: Yang, Yiwei, et al.
Published: (2025)

Optimizing ML Concurrent Computation and Communication with GPU DMA Engines
by: Agrawal, Anirudha, et al.
Published: (2024)

Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
by: Zhang, Chen, et al.
Published: (2026)

Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators
by: Shen, Aofeng, et al.
Published: (2025)

Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
by: Jarmusch, Aaron, et al.
Published: (2026)

Efficient Architecture for RISC-V Vector Memory Access
by: Guan, Hongyi, et al.
Published: (2025)

Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
by: Fatima, Amel, et al.
Published: (2026)

FinGraV: Methodology for Fine-Grain GPU Power Visibility and Insights
by: Singhania, Varsha, et al.
Published: (2024)

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
by: Lin, Bin, et al.
Published: (2024)

TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link
by: Zhang, Yichao, et al.
Published: (2026)

Memory-Centric Computing: Solving Computing's Memory Problem
by: Mutlu, Onur, et al.
Published: (2025)

FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
by: Zhu, Zeyu, et al.
Published: (2024)

Scheduling Techniques of AI Models on Modern Heterogeneous Edge GPU -- A Critical Review
by: Majeed, Ashiyana Abdul, et al.
Published: (2025)

Improving Multi-Instance GPU Efficiency via Sub-Entry Sharing TLB Design
by: Li, Bingyao, et al.
Published: (2024)

Fine-Grained Power and Energy Attribution on AMD GPU/APU-Based Exascale Nodes
by: McDaniel, Adam, et al.
Published: (2026)

Multi-Partner Project: Multi-GPU Performance Portability Analysis for CFD Simulations at Scale
by: Eleftherakis, Panagiotis-Eleftherios, et al.
Published: (2026)

DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling
by: Mei, Linyan, et al.
Published: (2022)

CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization
by: Feng, Weigang, et al.
Published: (2025)

GigaAPI for GPU Parallelization
by: Suvarna, M., et al.
Published: (2025)

DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management
by: Zhou, Zhongchun, et al.
Published: (2025)

Analyzing a Two-Tier Disaggregated Memory Protection Scheme Based on Memory Replication
by: Volos, Haris, et al.
Published: (2025)

SwarmIO: Towards 100 Million IOPS SSD Emulation for Next-generation GPU-centric Storage Systems
by: Kim, Hyeseong, et al.
Published: (2026)

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL
by: Torres, L. A., et al.
Published: (2024)

Pooling Engram Conditional Memory in Large Language Models using CXL
by: Ma, Ruiyang, et al.
Published: (2026)

EDEA: Efficient Dual-Engine Accelerator for Depthwise Separable Convolution with Direct Data Transfer
by: Chen, Yi, et al.
Published: (2025)

Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
by: Zhang, Qijun, et al.
Published: (2026)