Saved in:
| Main Authors: | Wang, Hansheng, Shi, Lu, duan, Zhekai, Wu, Panruo, Guo, Liwei, Zhang, Shaoshuai |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.02170 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Pipelined Dense Symmetric Eigenvalue Decomposition on Multi-GPU Architectures
by: Wang, Hansheng, et al.
Published: (2025)
by: Wang, Hansheng, et al.
Published: (2025)
Pipelet: Practical Streamlined Blockchain Protocol
by: Karihaloo, Vivek, et al.
Published: (2024)
by: Karihaloo, Vivek, et al.
Published: (2024)
Accelerating Sparse MTTKRP for Small Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2025)
by: Wijeratne, Sasindu, et al.
Published: (2025)
Gaia: Hybrid Hardware Acceleration for Serverless AI in the 3D Compute Continuum
by: Reisecker, Maximilian, et al.
Published: (2025)
by: Reisecker, Maximilian, et al.
Published: (2025)
PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration
by: Pacheco, Daniel, et al.
Published: (2026)
by: Pacheco, Daniel, et al.
Published: (2026)
SpArch: Efficient Architecture for Sparse Matrix Multiplication
by: Zhang, Zhekai, et al.
Published: (2020)
by: Zhang, Zhekai, et al.
Published: (2020)
AMPED: Accelerating MTTKRP for Billion-Scale Sparse Tensor Decomposition on Multiple GPUs
by: Wijeratne, Sasindu, et al.
Published: (2025)
by: Wijeratne, Sasindu, et al.
Published: (2025)
Communication-Efficient Model Aggregation with Layer Divergence Feedback in Federated Learning
by: Wang, Liwei, et al.
Published: (2024)
by: Wang, Liwei, et al.
Published: (2024)
Experimental Evaluation of Distributed k-Core Decomposition
by: Guo, Bin, et al.
Published: (2024)
by: Guo, Bin, et al.
Published: (2024)
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
by: Liang, Yan, et al.
Published: (2026)
by: Liang, Yan, et al.
Published: (2026)
CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization
by: Feng, Weigang, et al.
Published: (2025)
by: Feng, Weigang, et al.
Published: (2025)
Revealing the Challenges of Attention-FFN Disaggregation for Modern MoE Models and Hardware Systems
by: Liu, Guowei, et al.
Published: (2026)
by: Liu, Guowei, et al.
Published: (2026)
Federated k-Core Decomposition: A Secure Distributed Approach
by: Guo, Bin, et al.
Published: (2024)
by: Guo, Bin, et al.
Published: (2024)
Exploiting Multicast for Accelerating Collective Communication
by: Xu, Chao, et al.
Published: (2026)
by: Xu, Chao, et al.
Published: (2026)
Hardware-Agnostic and Insightful Efficiency Metrics for Accelerated Systems: Definition and Implementation within TALP
by: Rahimi, Ghazal, et al.
Published: (2026)
by: Rahimi, Ghazal, et al.
Published: (2026)
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)
by: Zhou, Zhuoshan, et al.
Published: (2026)
GPU-Accelerated Batch-Dynamic Subgraph Matching
by: Qiu, Linshan, et al.
Published: (2024)
by: Qiu, Linshan, et al.
Published: (2024)
TokenSim: Enabling Hardware and Software Exploration for Large Language Model Inference Systems
by: Wu, Feiyang, et al.
Published: (2025)
by: Wu, Feiyang, et al.
Published: (2025)
Leveraging Hardware-Aware Computation in Mixed-Precision Matrix Multiply: A Tile-Centric Approach
by: Zhang, Qiao, et al.
Published: (2025)
by: Zhang, Qiao, et al.
Published: (2025)
Accelerating Biclique Counting on GPU
by: Qiu, Linshan, et al.
Published: (2024)
by: Qiu, Linshan, et al.
Published: (2024)
Investigating Sharding Advancements, Methodologies, and Adoption Potential in Hedera
by: Wang, Ziwei, et al.
Published: (2025)
by: Wang, Ziwei, et al.
Published: (2025)
Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2024)
by: Wijeratne, Sasindu, et al.
Published: (2024)
Accelerating Sparse DNNs Based on Tiled GEMM
by: Guo, Cong, et al.
Published: (2024)
by: Guo, Cong, et al.
Published: (2024)
Accelerating OpenPangu Inference on NPU via Speculative Decoding
by: Dai, Yuntao, et al.
Published: (2026)
by: Dai, Yuntao, et al.
Published: (2026)
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
by: Wu, Tian, et al.
Published: (2025)
by: Wu, Tian, et al.
Published: (2025)
Federated Learning Using Coupled Tensor Train Decomposition
by: Zhang, Xiangtao, et al.
Published: (2024)
by: Zhang, Xiangtao, et al.
Published: (2024)
SP-MoE: Speculative Decoding and Prefetching for Accelerating MoE-based Model Inference
by: Chen, Liangkun, et al.
Published: (2025)
by: Chen, Liangkun, et al.
Published: (2025)
From Symmetric to Asymmetric Asynchronous Byzantine Consensus
by: Cachin, Christian, et al.
Published: (2020)
by: Cachin, Christian, et al.
Published: (2020)
Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
by: Adnan, Muhammad, et al.
Published: (2024)
by: Adnan, Muhammad, et al.
Published: (2024)
ReviveMoE: Fast Recovery for Hardware Failures in Large-Scale MoE LLM Inference Deployments
by: Li, Haley, et al.
Published: (2026)
by: Li, Haley, et al.
Published: (2026)
HMTRace: Hardware-Assisted Memory-Tagging based Dynamic Data Race Detection
by: Shastri, Jaidev, et al.
Published: (2024)
by: Shastri, Jaidev, et al.
Published: (2024)
Towards Energy-Efficient Serverless Computing with Hardware Isolation
by: Carl, Natalie, et al.
Published: (2025)
by: Carl, Natalie, et al.
Published: (2025)
gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters
by: Huang, Jiajun, et al.
Published: (2023)
by: Huang, Jiajun, et al.
Published: (2023)
GPZ: GPU-Accelerated Lossy Compressor for Particle Data
by: Li, Ruoyu, et al.
Published: (2025)
by: Li, Ruoyu, et al.
Published: (2025)
Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations
by: Daas, Hussam Al, et al.
Published: (2024)
by: Daas, Hussam Al, et al.
Published: (2024)
Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization
by: Zhou, Yangjie, et al.
Published: (2024)
by: Zhou, Yangjie, et al.
Published: (2024)
SCARIF: Towards Carbon Modeling of Cloud Servers with Accelerators
by: Ji, Shixin, et al.
Published: (2024)
by: Ji, Shixin, et al.
Published: (2024)
Enhancing ASIC Technology Mapping via Parallel Supergate Computing
by: Cai, Ye, et al.
Published: (2024)
by: Cai, Ye, et al.
Published: (2024)
Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control
by: Wang, Zhigang, et al.
Published: (2024)
by: Wang, Zhigang, et al.
Published: (2024)
Benchmarking Compound AI Applications for Hardware-Software Co-Design
by: Samuthrsindh, Paramuth, et al.
Published: (2026)
by: Samuthrsindh, Paramuth, et al.
Published: (2026)
Similar Items
-
Pipelined Dense Symmetric Eigenvalue Decomposition on Multi-GPU Architectures
by: Wang, Hansheng, et al.
Published: (2025) -
Pipelet: Practical Streamlined Blockchain Protocol
by: Karihaloo, Vivek, et al.
Published: (2024) -
Accelerating Sparse MTTKRP for Small Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2025) -
Gaia: Hybrid Hardware Acceleration for Serverless AI in the 3D Compute Continuum
by: Reisecker, Maximilian, et al.
Published: (2025) -
PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration
by: Pacheco, Daniel, et al.
Published: (2026)