:: Library Catalog

Imaxe de Portada

Gardado en:

Detalles Bibliográficos
Main Authors:	Uchino, Yuki, Ma, Qianxiang, Imamura, Toshiyuki, Ozaki, Katsuhisa, Gutsche, Patrick Lars
Formato:	Preprint
Publicado:	2025
Subjects:	Distributed, Parallel, and Cluster Computing
Acceso en liña:	https://arxiv.org/abs/2512.08321
Tags:	Engadir etiqueta Sen Etiquetas, Sexa o primeiro en etiquetar este rexistro!

Títulos similares

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines
por: Uchino, Yuki, et al.
Publicado: (2025)

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization
por: Uchino, Yuki, et al.
Publicado: (2026)

Error Analysis of Matrix Multiplication Emulation Using Ozaki-II Scheme
por: Uchino, Yuki, et al.
Publicado: (2026)

Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit
por: Uchino, Yuki, et al.
Publicado: (2024)

DGEMM on Integer Matrix Multiplication Unit
por: Ootomo, Hiroyuki, et al.
Publicado: (2023)

An inherently parallel H2-ULV factorization for solving dense linear systems on GPUs
por: Ma, Qianxiang, et al.
Publicado: (2025)

Fast Kronecker Matrix-Matrix Multiplication on GPUs
por: Jangda, Abhinav, et al.
Publicado: (2024)

Sparsity-Aware Roofline Models for Sparse Matrix-Matrix Multiplication
por: Qian, Matthew, et al.
Publicado: (2026)

LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures
por: Lei, Kelun, et al.
Publicado: (2025)

Accelerating Sparse Matrix-Matrix Multiplication on GPUs with Processing Near HBMs
por: Li, Shiju, et al.
Publicado: (2025)

MAGNUS: Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs
por: Wolfson-Pou, Jordi, et al.
Publicado: (2025)

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures
por: Liu, Jie, et al.
Publicado: (2026)

Improving Locality in Sparse and Dense Matrix Multiplications
por: Dezfuli, Mohammad Mahdi Salehi, et al.
Publicado: (2024)

ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs
por: Zhang, Lixing, et al.
Publicado: (2026)

Hello SME! Generating Fast Matrix Multiplication Kernels Using the Scalable Matrix Extension
por: Remke, Stefan, et al.
Publicado: (2024)

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication
por: Ranawaka, Isuru, et al.
Publicado: (2024)

Demystifying ARM SME to Optimize General Matrix Multiplications
por: Deng, Chencheng, et al.
Publicado: (2025)

RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs
por: Brock, Benjamin, et al.
Publicado: (2023)

Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer
por: Adefemi, Temitayo
Publicado: (2024)

HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores
por: Li, Zhonggen, et al.
Publicado: (2024)

RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs
por: Li, Aiying, et al.
Publicado: (2026)

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3
por: Shah, Milan, et al.
Publicado: (2026)

Is Sparse Matrix Reordering Effective for Sparse Matrix-Vector Multiplication?
por: Asudeh, Omid, et al.
Publicado: (2025)

FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
por: Zhu, Honglin, et al.
Publicado: (2026)

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose
por: Arrigoni, Viviana, et al.
Publicado: (2021)

Matrix Multiplication in the MPC Model
por: Joshi, Lakshya, et al.
Publicado: (2025)

Stencil Matrixization
por: Zhao, Wenxuan, et al.
Publicado: (2023)

Tensor-Parallel Emulation of Quantum Circuits with Block-Cyclic Distributed Matrix Product States
por: Adamski, Jakub, et al.
Publicado: (2025)

BouquetFL: Emulating diverse participant hardware in Federated Learning
por: Geimer, Arno
Publicado: (2026)

Mapping Parallel Matrix Multiplication in GotoBLAS2 to the AMD Versal ACAP for Deep Learning
por: Lei, Jie, et al.
Publicado: (2024)

Emulating a computing grid in a local environment for feature evaluation
por: Kalawana, Jananga, et al.
Publicado: (2024)

Scaled Block Vecchia Approximation for High-Dimensional Gaussian Process Emulation on GPUs
por: Pan, Qilong, et al.
Publicado: (2025)

PARS3: Parallel Sparse Skew-Symmetric Matrix-Vector Multiplication with Reverse Cuthill-McKee Reordering
por: Yildirim, Selin, et al.
Publicado: (2024)

Selection of Supervised Learning-based Sparse Matrix Reordering Algorithms
por: Tang, Tao, et al.
Publicado: (2025)

BlockEmulator: An Emulator Enabling to Test Blockchain Sharding Protocols
por: Huang, Huawei, et al.
Publicado: (2023)

LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
por: Da, Wei, et al.
Publicado: (2026)

SHIRO: Near-Optimal Communication Strategies for Distributed Sparse Matrix Multiplication
por: Zhuang, Chen, et al.
Publicado: (2025)

Closing a Source Complexity Gap between Chapel and HPX
por: Atre, Shreyas, et al.
Publicado: (2025)

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU
por: Li, Yifan, et al.
Publicado: (2026)

W4A16 Mixed-Precision Matrix Multiplication on Decoupled Architecture: Kernel Design and Memory Bottleneck Analysis for Ascend NPUs
por: He, Yuanhong, et al.
Publicado: (2026)