Saved in:
| Main Authors: | Zhu, Wenbin, Shen, Zhaoyan, Shao, Zili, Dai, Hongjun, Chen, Feng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.01357 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2024)
by: Wijeratne, Sasindu, et al.
Published: (2024)
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026)
by: li, Fei, et al.
Published: (2026)
Application Experiences on a GPU-Accelerated Arm-based HPC Testbed
by: Elwasif, Wael, et al.
Published: (2022)
by: Elwasif, Wael, et al.
Published: (2022)
Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference
by: Chung, Euijun, et al.
Published: (2026)
by: Chung, Euijun, et al.
Published: (2026)
An Evaluation and Comparison of GPU Hardware and Solver Libraries for Accelerating the OPM Flow Reservoir Simulator
by: Qiu, Tong Dong, et al.
Published: (2023)
by: Qiu, Tong Dong, et al.
Published: (2023)
Accelerating Triangle Counting with Real Processing-in-Memory Systems
by: Asquini, Lorenzo, et al.
Published: (2025)
by: Asquini, Lorenzo, et al.
Published: (2025)
Balanced Data Placement for GEMV Acceleration with Processing-In-Memory
by: Ibrahim, Mohamed Assem, et al.
Published: (2024)
by: Ibrahim, Mohamed Assem, et al.
Published: (2024)
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)
by: Zhou, Zhuoshan, et al.
Published: (2026)
Chopper: A Multi-Level GPU Characterization Tool & Derived Insights Into LLM Training Inefficiency
by: Kurzynski, Marco, et al.
Published: (2025)
by: Kurzynski, Marco, et al.
Published: (2025)
CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories
by: Shi, Man, et al.
Published: (2024)
by: Shi, Man, et al.
Published: (2024)
CELLO: Co-designing Schedule and Hybrid Implicit/Explicit Buffer for Complex Tensor Reuse
by: Garg, Raveesh, et al.
Published: (2023)
by: Garg, Raveesh, et al.
Published: (2023)
Accelerating Recommender Model ETL with a Streaming FPGA-GPU Dataflow
by: Zhu, Yu, et al.
Published: (2025)
by: Zhu, Yu, et al.
Published: (2025)
MVDRAM: Enabling GeMV Execution in Unmodified DRAM for Low-Bit LLM Acceleration
by: Kubo, Tatsuya, et al.
Published: (2025)
by: Kubo, Tatsuya, et al.
Published: (2025)
PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System
by: Liu, Lian, et al.
Published: (2026)
by: Liu, Lian, et al.
Published: (2026)
HetGPU: The pursuit of making binary compatibility towards GPUs
by: Yang, Yiwei, et al.
Published: (2025)
by: Yang, Yiwei, et al.
Published: (2025)
Optimizing ML Concurrent Computation and Communication with GPU DMA Engines
by: Agrawal, Anirudha, et al.
Published: (2024)
by: Agrawal, Anirudha, et al.
Published: (2024)
Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
by: Zhang, Chen, et al.
Published: (2026)
by: Zhang, Chen, et al.
Published: (2026)
Design in Tiles: Automating GEMM Deployment on Tile-Based Many-PE Accelerators
by: Shen, Aofeng, et al.
Published: (2025)
by: Shen, Aofeng, et al.
Published: (2025)
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
by: Jarmusch, Aaron, et al.
Published: (2026)
by: Jarmusch, Aaron, et al.
Published: (2026)
Efficient Architecture for RISC-V Vector Memory Access
by: Guan, Hongyi, et al.
Published: (2025)
by: Guan, Hongyi, et al.
Published: (2025)
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
by: Fatima, Amel, et al.
Published: (2026)
by: Fatima, Amel, et al.
Published: (2026)
FinGraV: Methodology for Fine-Grain GPU Power Visibility and Insights
by: Singhania, Varsha, et al.
Published: (2024)
by: Singhania, Varsha, et al.
Published: (2024)
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
by: Lin, Bin, et al.
Published: (2024)
by: Lin, Bin, et al.
Published: (2024)
TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link
by: Zhang, Yichao, et al.
Published: (2026)
by: Zhang, Yichao, et al.
Published: (2026)
Memory-Centric Computing: Solving Computing's Memory Problem
by: Mutlu, Onur, et al.
Published: (2025)
by: Mutlu, Onur, et al.
Published: (2025)
FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
by: Zhu, Zeyu, et al.
Published: (2024)
by: Zhu, Zeyu, et al.
Published: (2024)
Scheduling Techniques of AI Models on Modern Heterogeneous Edge GPU -- A Critical Review
by: Majeed, Ashiyana Abdul, et al.
Published: (2025)
by: Majeed, Ashiyana Abdul, et al.
Published: (2025)
Improving Multi-Instance GPU Efficiency via Sub-Entry Sharing TLB Design
by: Li, Bingyao, et al.
Published: (2024)
by: Li, Bingyao, et al.
Published: (2024)
Fine-Grained Power and Energy Attribution on AMD GPU/APU-Based Exascale Nodes
by: McDaniel, Adam, et al.
Published: (2026)
by: McDaniel, Adam, et al.
Published: (2026)
Multi-Partner Project: Multi-GPU Performance Portability Analysis for CFD Simulations at Scale
by: Eleftherakis, Panagiotis-Eleftherios, et al.
Published: (2026)
by: Eleftherakis, Panagiotis-Eleftherios, et al.
Published: (2026)
DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling
by: Mei, Linyan, et al.
Published: (2022)
by: Mei, Linyan, et al.
Published: (2022)
CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization
by: Feng, Weigang, et al.
Published: (2025)
by: Feng, Weigang, et al.
Published: (2025)
GigaAPI for GPU Parallelization
by: Suvarna, M., et al.
Published: (2025)
by: Suvarna, M., et al.
Published: (2025)
DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management
by: Zhou, Zhongchun, et al.
Published: (2025)
by: Zhou, Zhongchun, et al.
Published: (2025)
Analyzing a Two-Tier Disaggregated Memory Protection Scheme Based on Memory Replication
by: Volos, Haris, et al.
Published: (2025)
by: Volos, Haris, et al.
Published: (2025)
SwarmIO: Towards 100 Million IOPS SSD Emulation for Next-generation GPU-centric Storage Systems
by: Kim, Hyeseong, et al.
Published: (2026)
by: Kim, Hyeseong, et al.
Published: (2026)
Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL
by: Torres, L. A., et al.
Published: (2024)
by: Torres, L. A., et al.
Published: (2024)
Pooling Engram Conditional Memory in Large Language Models using CXL
by: Ma, Ruiyang, et al.
Published: (2026)
by: Ma, Ruiyang, et al.
Published: (2026)
EDEA: Efficient Dual-Engine Accelerator for Depthwise Separable Convolution with Direct Data Transfer
by: Chen, Yi, et al.
Published: (2025)
by: Chen, Yi, et al.
Published: (2025)
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
by: Zhang, Qijun, et al.
Published: (2026)
by: Zhang, Qijun, et al.
Published: (2026)
Similar Items
-
Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2024) -
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
by: li, Fei, et al.
Published: (2026) -
Application Experiences on a GPU-Accelerated Arm-based HPC Testbed
by: Elwasif, Wael, et al.
Published: (2022) -
Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference
by: Chung, Euijun, et al.
Published: (2026) -
An Evaluation and Comparison of GPU Hardware and Solver Libraries for Accelerating the OPM Flow Reservoir Simulator
by: Qiu, Tong Dong, et al.
Published: (2023)