Gardado en:
| Main Authors: | Uchino, Yuki, Ma, Qianxiang, Imamura, Toshiyuki, Ozaki, Katsuhisa, Gutsche, Patrick Lars |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Subjects: | |
| Acceso en liña: | https://arxiv.org/abs/2512.08321 |
| Tags: |
Engadir etiqueta
Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!
|
Títulos similares
High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines
por: Uchino, Yuki, et al.
Publicado: (2025)
por: Uchino, Yuki, et al.
Publicado: (2025)
Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization
por: Uchino, Yuki, et al.
Publicado: (2026)
por: Uchino, Yuki, et al.
Publicado: (2026)
Error Analysis of Matrix Multiplication Emulation Using Ozaki-II Scheme
por: Uchino, Yuki, et al.
Publicado: (2026)
por: Uchino, Yuki, et al.
Publicado: (2026)
Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit
por: Uchino, Yuki, et al.
Publicado: (2024)
por: Uchino, Yuki, et al.
Publicado: (2024)
DGEMM on Integer Matrix Multiplication Unit
por: Ootomo, Hiroyuki, et al.
Publicado: (2023)
por: Ootomo, Hiroyuki, et al.
Publicado: (2023)
An inherently parallel H2-ULV factorization for solving dense linear systems on GPUs
por: Ma, Qianxiang, et al.
Publicado: (2025)
por: Ma, Qianxiang, et al.
Publicado: (2025)
Fast Kronecker Matrix-Matrix Multiplication on GPUs
por: Jangda, Abhinav, et al.
Publicado: (2024)
por: Jangda, Abhinav, et al.
Publicado: (2024)
Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication
por: Qian, Matthew, et al.
Publicado: (2026)
por: Qian, Matthew, et al.
Publicado: (2026)
LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures
por: Lei, Kelun, et al.
Publicado: (2025)
por: Lei, Kelun, et al.
Publicado: (2025)
Accelerating Sparse Matrix-Matrix Multiplication on GPUs with Processing Near HBMs
por: Li, Shiju, et al.
Publicado: (2025)
por: Li, Shiju, et al.
Publicado: (2025)
MAGNUS: Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs
por: Wolfson-Pou, Jordi, et al.
Publicado: (2025)
por: Wolfson-Pou, Jordi, et al.
Publicado: (2025)
AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
por: Liu, Jie, et al.
Publicado: (2026)
por: Liu, Jie, et al.
Publicado: (2026)
Improving Locality in Sparse and Dense Matrix Multiplications
por: Dezfuli, Mohammad Mahdi Salehi, et al.
Publicado: (2024)
por: Dezfuli, Mohammad Mahdi Salehi, et al.
Publicado: (2024)
ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs
por: Zhang, Lixing, et al.
Publicado: (2026)
por: Zhang, Lixing, et al.
Publicado: (2026)
Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension
por: Remke, Stefan, et al.
Publicado: (2024)
por: Remke, Stefan, et al.
Publicado: (2024)
Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication
por: Ranawaka, Isuru, et al.
Publicado: (2024)
por: Ranawaka, Isuru, et al.
Publicado: (2024)
Demystifying ARM SME to Optimize General Matrix Multiplications
por: Deng, Chencheng, et al.
Publicado: (2025)
por: Deng, Chencheng, et al.
Publicado: (2025)
RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs
por: Brock, Benjamin, et al.
Publicado: (2023)
por: Brock, Benjamin, et al.
Publicado: (2023)
Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer
por: Adefemi, Temitayo
Publicado: (2024)
por: Adefemi, Temitayo
Publicado: (2024)
HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores
por: Li, Zhonggen, et al.
Publicado: (2024)
por: Li, Zhonggen, et al.
Publicado: (2024)
RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs
por: Li, Aiying, et al.
Publicado: (2026)
por: Li, Aiying, et al.
Publicado: (2026)
Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3
por: Shah, Milan, et al.
Publicado: (2026)
por: Shah, Milan, et al.
Publicado: (2026)
Is Sparse Matrix Reordering Effective for Sparse Matrix-Vector Multiplication?
por: Asudeh, Omid, et al.
Publicado: (2025)
por: Asudeh, Omid, et al.
Publicado: (2025)
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
por: Zhu, Honglin, et al.
Publicado: (2026)
por: Zhu, Honglin, et al.
Publicado: (2026)
Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose
por: Arrigoni, Viviana, et al.
Publicado: (2021)
por: Arrigoni, Viviana, et al.
Publicado: (2021)
Matrix Multiplication in the MPC Model
por: Joshi, Lakshya, et al.
Publicado: (2025)
por: Joshi, Lakshya, et al.
Publicado: (2025)
Stencil Matrixization
por: Zhao, Wenxuan, et al.
Publicado: (2023)
por: Zhao, Wenxuan, et al.
Publicado: (2023)
Tensor-Parallel Emulation of Quantum Circuits with Block-Cyclic Distributed Matrix Product States
por: Adamski, Jakub, et al.
Publicado: (2025)
por: Adamski, Jakub, et al.
Publicado: (2025)
BouquetFL: Emulating diverse participant hardware in Federated Learning
por: Geimer, Arno
Publicado: (2026)
por: Geimer, Arno
Publicado: (2026)
Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning
por: Lei, Jie, et al.
Publicado: (2024)
por: Lei, Jie, et al.
Publicado: (2024)
Emulating a computing grid in a local environment for feature evaluation
por: Kalawana, Jananga, et al.
Publicado: (2024)
por: Kalawana, Jananga, et al.
Publicado: (2024)
Scaled Block Vecchia Approximation for High-Dimensional Gaussian Process Emulation on GPUs
por: Pan, Qilong, et al.
Publicado: (2025)
por: Pan, Qilong, et al.
Publicado: (2025)
PARS3: Parallel Sparse Skew-Symmetric Matrix-Vector Multiplication with Reverse Cuthill-McKee Reordering
por: Yildirim, Selin, et al.
Publicado: (2024)
por: Yildirim, Selin, et al.
Publicado: (2024)
Selection of Supervised Learning-based Sparse Matrix Reordering Algorithms
por: Tang, Tao, et al.
Publicado: (2025)
por: Tang, Tao, et al.
Publicado: (2025)
BlockEmulator: An Emulator Enabling to Test Blockchain Sharding Protocols
por: Huang, Huawei, et al.
Publicado: (2023)
por: Huang, Huawei, et al.
Publicado: (2023)
LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
por: Da, Wei, et al.
Publicado: (2026)
por: Da, Wei, et al.
Publicado: (2026)
SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication
por: Zhuang, Chen, et al.
Publicado: (2025)
por: Zhuang, Chen, et al.
Publicado: (2025)
Closing a Source Complexity Gap between Chapel and HPX
por: Atre, Shreyas, et al.
Publicado: (2025)
por: Atre, Shreyas, et al.
Publicado: (2025)
Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU
por: Li, Yifan, et al.
Publicado: (2026)
por: Li, Yifan, et al.
Publicado: (2026)
W4A16 Mixed-Precision Matrix Multiplication on Decoupled Architecture: Kernel Design and Memory Bottleneck Analysis for Ascend NPUs
por: He, Yuanhong, et al.
Publicado: (2026)
por: He, Yuanhong, et al.
Publicado: (2026)
Títulos similares
-
High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines
por: Uchino, Yuki, et al.
Publicado: (2025) -
Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization
por: Uchino, Yuki, et al.
Publicado: (2026) -
Error Analysis of Matrix Multiplication Emulation Using Ozaki-II Scheme
por: Uchino, Yuki, et al.
Publicado: (2026) -
Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit
por: Uchino, Yuki, et al.
Publicado: (2024) -
DGEMM on Integer Matrix Multiplication Unit
por: Ootomo, Hiroyuki, et al.
Publicado: (2023)