:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	GU, Qiqi, Wu, Chenpeng, Shi, Heng, Yao, Jianguo
Formato:	Preprint
Publicado:	2025
Materias:	Distributed, Parallel, and Cluster Computing
Acceso en línea:	https://arxiv.org/abs/2506.22035
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Do We Need Tensor Cores for Stencil Computations?
por: Gu, Qiqi, et al.
Publicado: (2026)

Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores
por: Wu, Chenpeng, et al.
Publicado: (2025)

Stencil Matrixization
por: Zhao, Wenxuan, et al.
Publicado: (2023)

Stencil Computations on Tenstorrent Wormhole
por: Piarulli, Lorenzo, et al.
Publicado: (2026)

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures
por: Shan, Baodi, et al.
Publicado: (2024)

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures
por: Sai, Ryuichi, et al.
Publicado: (2023)

Persistent and Partitioned MPI for Stencil Communication
por: Collom, Gerald, et al.
Publicado: (2025)

An Adaptive Distributed Stencil Abstraction for GPUs
por: Bhosale, Aditya, et al.
Publicado: (2025)

To Repair or Not to Repair: Assessing Fault Resilience in MPI Stencil Applications
por: Rocco, Roberto, et al.
Publicado: (2024)

QPU Micro-Kernels for Stencil Computation
por: Markidis, Stefano, et al.
Publicado: (2025)

MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit
por: Wang, Yinuo, et al.
Publicado: (2025)

High Performance Unstructured SpMM Computation Using Tensor Cores
por: Okanovic, Patrik, et al.
Publicado: (2024)

HotSwap: Enabling Live Dependency Sharing in Serverless Computing
por: Li, Rui, et al.
Publicado: (2024)

ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology
por: Tang, Ding, et al.
Publicado: (2024)

Fused3S: Fast Sparse Attention on Tensor Cores
por: Li, Zitong, et al.
Publicado: (2025)

Generalized Compare and Swap
por: Hadzilacos, Vassos, et al.
Publicado: (2024)

KUBEDIRECT: Unleashing the Full Power of the Cluster Manager for Serverless Computing
por: Qi, Sheng, et al.
Publicado: (2026)

cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores
por: Li, Zixuan, et al.
Publicado: (2024)

Accelerating Sparse MTTKRP for Small Tensor Decomposition on GPU
por: Wijeratne, Sasindu, et al.
Publicado: (2025)

Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
por: Xu, Guanbin, et al.
Publicado: (2026)

Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics
por: Rose, Martin, et al.
Publicado: (2025)

PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration
por: Pacheco, Daniel, et al.
Publicado: (2026)

Stencil Computations on Cerebras Wafer-Scale Engine
por: Belli, Elia, et al.
Publicado: (2026)

Unleashing Collaborative Computing for Adaptive Video Streaming with Multi-objective Optimization in Satellite Terrestrial Networks
por: Shen, Zhishu, et al.
Publicado: (2024)

Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores
por: Schieffer, Gabin, et al.
Publicado: (2024)

AMPED: Accelerating MTTKRP for Billion-Scale Sparse Tensor Decomposition on Multiple GPUs
por: Wijeratne, Sasindu, et al.
Publicado: (2025)

PeerSwap: A Peer-Sampler with Randomness Guarantees
por: Guerraoui, Rachid, et al.
Publicado: (2024)

ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace
por: Shi, Ruimin, et al.
Publicado: (2025)

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores
por: Shi, Jinliang, et al.
Publicado: (2024)

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
por: Zhao, Yilong, et al.
Publicado: (2026)

Unleashing Efficient Asynchronous RL Post-Training via Staleness-Constrained Rollout Coordination
por: Li, Haoyang, et al.
Publicado: (2026)

Unleashing Multicore Strength for Efficient Execution of Transactions
por: Ravish, Ankit, et al.
Publicado: (2024)

FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection
por: Huang, Ziyu, et al.
Publicado: (2025)

Predictive Performance of Photonic SRAM-based In-Memory Computing for Tensor Decomposition
por: Wijeratne, Sasindu, et al.
Publicado: (2025)

Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation
por: Daas, Hussam Al, et al.
Publicado: (2025)

An MLIR Lowering Pipeline for Stencils at Wafer-Scale
por: Stawinoga, Nicolai, et al.
Publicado: (2026)

Generalized Compare-and-Swap and Space-Efficient Universal Constructions for the Infinite-Arrival Model
por: Hadzilacos, Vassos, et al.
Publicado: (2026)

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)
por: Zhang, Lingqi, et al.
Publicado: (2025)

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
por: Wijeratne, Sasindu, et al.
Publicado: (2024)

Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme
por: Schwarz, Angelika, et al.
Publicado: (2025)