:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Hansheng, Shi, Lu, duan, Zhekai, Wu, Panruo, Guo, Liwei, Zhang, Shaoshuai
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2410.02170
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Pipelined Dense Symmetric Eigenvalue Decomposition on Multi-GPU Architectures
by: Wang, Hansheng, et al.
Published: (2025)

Pipelet: Practical Streamlined Blockchain Protocol
by: Karihaloo, Vivek, et al.
Published: (2024)

Accelerating Sparse MTTKRP for Small Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2025)

Gaia: Hybrid Hardware Acceleration for Serverless AI in the 3D Compute Continuum
by: Reisecker, Maximilian, et al.
Published: (2025)

PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration
by: Pacheco, Daniel, et al.
Published: (2026)

SpArch: Efficient Architecture for Sparse Matrix Multiplication
by: Zhang, Zhekai, et al.
Published: (2020)

AMPED: Accelerating MTTKRP for Billion-Scale Sparse Tensor Decomposition on Multiple GPUs
by: Wijeratne, Sasindu, et al.
Published: (2025)

Communication-Efficient Model Aggregation with Layer Divergence Feedback in Federated Learning
by: Wang, Liwei, et al.
Published: (2024)

Experimental Evaluation of Distributed k-Core Decomposition
by: Guo, Bin, et al.
Published: (2024)

HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
by: Liang, Yan, et al.
Published: (2026)

CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization
by: Feng, Weigang, et al.
Published: (2025)

Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems
by: Liu, Guowei, et al.
Published: (2026)

Federated k-Core Decomposition: A Secure Distributed Approach
by: Guo, Bin, et al.
Published: (2024)

Exploiting Multicast for Accelerating Collective Communication
by: Xu, Chao, et al.
Published: (2026)

Hardware-Agnostic and Insightful Efficiency Metrics for Accelerated Systems: Definition and Implementation within TALP
by: Rahimi, Ghazal, et al.
Published: (2026)

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)

GPU-Accelerated Batch-Dynamic Subgraph Matching
by: Qiu, Linshan, et al.
Published: (2024)

TokenSim: Enabling Hardware and Software Exploration for Large Language Model Inference Systems
by: Wu, Feiyang, et al.
Published: (2025)

Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach
by: Zhang, Qiao, et al.
Published: (2025)

Accelerating Biclique Counting on GPU
by: Qiu, Linshan, et al.
Published: (2024)

Investigating Sharding Advancements, Methodologies, and Adoption Potential in Hedera
by: Wang, Ziwei, et al.
Published: (2025)

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2024)

Accelerating Sparse DNNs Based on Tiled GEMM
by: Guo, Cong, et al.
Published: (2024)

Accelerating OpenPangu Inference on NPU via Speculative Decoding
by: Dai, Yuntao, et al.
Published: (2026)

Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
by: Wu, Tian, et al.
Published: (2025)

Federated Learning Using Coupled Tensor Train Decomposition
by: Zhang, Xiangtao, et al.
Published: (2024)

SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
by: Chen, Liangkun, et al.
Published: (2025)

From Symmetric to Asymmetric Asynchronous Byzantine Consensus
by: Cachin, Christian, et al.
Published: (2020)

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
by: Adnan, Muhammad, et al.
Published: (2024)

ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026)

HMTRace: Hardware-Assisted Memory-Tagging based Dynamic Data Race Detection
by: Shastri, Jaidev, et al.
Published: (2024)

Towards Energy-Efficient Serverless Computing with Hardware Isolation
by: Carl, Natalie, et al.
Published: (2025)

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
by: Huang, Jiajun, et al.
Published: (2023)

GPZ: GPU-Accelerated Lossy Compressor for Particle Data
by: Li, Ruoyu, et al.
Published: (2025)

Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations
by: Daas, Hussam Al, et al.
Published: (2024)

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization
by: Zhou, Yangjie, et al.
Published: (2024)

SCARIF: Towards Carbon Modeling of Cloud Servers with Accelerators
by: Ji, Shixin, et al.
Published: (2024)

Enhancing ASIC Technology Mapping via Parallel Supergate Computing
by: Cai, Ye, et al.
Published: (2024)

Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
by: Wang, Zhigang, et al.
Published: (2024)

Benchmarking Compound AI Applications for Hardware-Software Co-Design
by: Samuthrsindh, Paramuth, et al.
Published: (2026)