Guardado en:
| Autores principales: | GU, Qiqi, Wu, Chenpeng, Shi, Heng, Yao, Jianguo |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2506.22035 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Do We Need Tensor Cores for Stencil Computations?
por: Gu, Qiqi, et al.
Publicado: (2026)
por: Gu, Qiqi, et al.
Publicado: (2026)
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores
por: Wu, Chenpeng, et al.
Publicado: (2025)
por: Wu, Chenpeng, et al.
Publicado: (2025)
Stencil Matrixization
por: Zhao, Wenxuan, et al.
Publicado: (2023)
por: Zhao, Wenxuan, et al.
Publicado: (2023)
Stencil Computations on Tenstorrent Wormhole
por: Piarulli, Lorenzo, et al.
Publicado: (2026)
por: Piarulli, Lorenzo, et al.
Publicado: (2026)
Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures
por: Shan, Baodi, et al.
Publicado: (2024)
por: Shan, Baodi, et al.
Publicado: (2024)
A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures
por: Sai, Ryuichi, et al.
Publicado: (2023)
por: Sai, Ryuichi, et al.
Publicado: (2023)
Persistent and Partitioned MPI for Stencil Communication
por: Collom, Gerald, et al.
Publicado: (2025)
por: Collom, Gerald, et al.
Publicado: (2025)
An Adaptive Distributed Stencil Abstraction for GPUs
por: Bhosale, Aditya, et al.
Publicado: (2025)
por: Bhosale, Aditya, et al.
Publicado: (2025)
To Repair or Not to Repair: Assessing Fault Resilience in MPI Stencil Applications
por: Rocco, Roberto, et al.
Publicado: (2024)
por: Rocco, Roberto, et al.
Publicado: (2024)
QPU Micro-Kernels for Stencil Computation
por: Markidis, Stefano, et al.
Publicado: (2025)
por: Markidis, Stefano, et al.
Publicado: (2025)
MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit
por: Wang, Yinuo, et al.
Publicado: (2025)
por: Wang, Yinuo, et al.
Publicado: (2025)
High Performance Unstructured SpMM Computation Using Tensor Cores
por: Okanovic, Patrik, et al.
Publicado: (2024)
por: Okanovic, Patrik, et al.
Publicado: (2024)
HotSwap: Enabling Live Dependency Sharing in Serverless Computing
por: Li, Rui, et al.
Publicado: (2024)
por: Li, Rui, et al.
Publicado: (2024)
ZeroPP: Unleashing Exceptional Parallelism Efficiency through Tensor-Parallelism-Free Methodology
por: Tang, Ding, et al.
Publicado: (2024)
por: Tang, Ding, et al.
Publicado: (2024)
Fused3S: Fast Sparse Attention on Tensor Cores
por: Li, Zitong, et al.
Publicado: (2025)
por: Li, Zitong, et al.
Publicado: (2025)
Generalized Compare and Swap
por: Hadzilacos, Vassos, et al.
Publicado: (2024)
por: Hadzilacos, Vassos, et al.
Publicado: (2024)
KUBEDIRECT: Unleashing the Full Power of the Cluster Manager for Serverless Computing
por: Qi, Sheng, et al.
Publicado: (2026)
por: Qi, Sheng, et al.
Publicado: (2026)
cuFastTuckerPlus: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores
por: Li, Zixuan, et al.
Publicado: (2024)
por: Li, Zixuan, et al.
Publicado: (2024)
Accelerating Sparse MTTKRP for Small Tensor Decomposition on GPU
por: Wijeratne, Sasindu, et al.
Publicado: (2025)
por: Wijeratne, Sasindu, et al.
Publicado: (2025)
Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
por: Xu, Guanbin, et al.
Publicado: (2026)
por: Xu, Guanbin, et al.
Publicado: (2026)
Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics
por: Rose, Martin, et al.
Publicado: (2025)
por: Rose, Martin, et al.
Publicado: (2025)
PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration
por: Pacheco, Daniel, et al.
Publicado: (2026)
por: Pacheco, Daniel, et al.
Publicado: (2026)
Stencil Computations on Cerebras Wafer-Scale Engine
por: Belli, Elia, et al.
Publicado: (2026)
por: Belli, Elia, et al.
Publicado: (2026)
Unleashing Collaborative Computing for Adaptive Video Streaming with Multi-objective Optimization in Satellite Terrestrial Networks
por: Shen, Zhishu, et al.
Publicado: (2024)
por: Shen, Zhishu, et al.
Publicado: (2024)
Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores
por: Schieffer, Gabin, et al.
Publicado: (2024)
por: Schieffer, Gabin, et al.
Publicado: (2024)
AMPED: Accelerating MTTKRP for Billion-Scale Sparse Tensor Decomposition on Multiple GPUs
por: Wijeratne, Sasindu, et al.
Publicado: (2025)
por: Wijeratne, Sasindu, et al.
Publicado: (2025)
PeerSwap: A Peer-Sampler with Randomness Guarantees
por: Guerraoui, Rachid, et al.
Publicado: (2024)
por: Guerraoui, Rachid, et al.
Publicado: (2024)
ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace
por: Shi, Ruimin, et al.
Publicado: (2025)
por: Shi, Ruimin, et al.
Publicado: (2025)
FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores
por: Shi, Jinliang, et al.
Publicado: (2024)
por: Shi, Jinliang, et al.
Publicado: (2024)
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
por: Zhao, Yilong, et al.
Publicado: (2026)
por: Zhao, Yilong, et al.
Publicado: (2026)
Unleashing Efficient Asynchronous RL Post-Training via Staleness-Constrained Rollout Coordination
por: Li, Haoyang, et al.
Publicado: (2026)
por: Li, Haoyang, et al.
Publicado: (2026)
Unleashing Multicore Strength for Efficient Execution of Transactions
por: Ravish, Ankit, et al.
Publicado: (2024)
por: Ravish, Ankit, et al.
Publicado: (2024)
FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive Operators via Inter-Core Connection
por: Huang, Ziyu, et al.
Publicado: (2025)
por: Huang, Ziyu, et al.
Publicado: (2025)
Predictive Performance of Photonic SRAM-based In-Memory Computing for Tensor Decomposition
por: Wijeratne, Sasindu, et al.
Publicado: (2025)
por: Wijeratne, Sasindu, et al.
Publicado: (2025)
Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation
por: Daas, Hussam Al, et al.
Publicado: (2025)
por: Daas, Hussam Al, et al.
Publicado: (2025)
An MLIR Lowering Pipeline for Stencils at Wafer-Scale
por: Stawinoga, Nicolai, et al.
Publicado: (2026)
por: Stawinoga, Nicolai, et al.
Publicado: (2026)
Generalized Compare-and-Swap and Space-Efficient Universal Constructions for the Infinite-Arrival Model
por: Hadzilacos, Vassos, et al.
Publicado: (2026)
por: Hadzilacos, Vassos, et al.
Publicado: (2026)
Can Tensor Cores Benefit Memory-Bound Kernels? (No!)
por: Zhang, Lingqi, et al.
Publicado: (2025)
por: Zhang, Lingqi, et al.
Publicado: (2025)
Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
por: Wijeratne, Sasindu, et al.
Publicado: (2024)
por: Wijeratne, Sasindu, et al.
Publicado: (2024)
Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme
por: Schwarz, Angelika, et al.
Publicado: (2025)
por: Schwarz, Angelika, et al.
Publicado: (2025)
Ejemplares similares
-
Do We Need Tensor Cores for Stencil Computations?
por: Gu, Qiqi, et al.
Publicado: (2026) -
Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores
por: Wu, Chenpeng, et al.
Publicado: (2025) -
Stencil Matrixization
por: Zhao, Wenxuan, et al.
Publicado: (2023) -
Stencil Computations on Tenstorrent Wormhole
por: Piarulli, Lorenzo, et al.
Publicado: (2026) -
Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures
por: Shan, Baodi, et al.
Publicado: (2024)