Saved in:
| Main Authors: | Pandey, Santosh, Wang, Zhibin, Zhong, Sheng, Tian, Chen, Zheng, Bolong, Li, Xiaoye, Li, Lingda, Hoisie, Adolfy, Ding, Caiwen, Li, Dong, Liu, Hang |
|---|---|
| Format: | Preprint |
| Published: |
2021
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2103.08053 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs
by: Xie, Xi, et al.
Published: (2024)
by: Xie, Xi, et al.
Published: (2024)
Managing Multi Instance GPUs for High Throughput and Energy Savings
by: Saraha, Abhijeet, et al.
Published: (2025)
by: Saraha, Abhijeet, et al.
Published: (2025)
FPTC: A Fast Parallel Transform-based Codec for Efficient Asymmetric Signal Compression
by: Mechels, Ben, et al.
Published: (2026)
by: Mechels, Ben, et al.
Published: (2026)
Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs
by: Peng, Hongwu, et al.
Published: (2023)
by: Peng, Hongwu, et al.
Published: (2023)
Agent-Based Triangle Counting: Unlocking Truss Decomposition, Triangle Centrality, and Local Clustering Coefficient
by: Chand, Prabhat Kumar, et al.
Published: (2024)
by: Chand, Prabhat Kumar, et al.
Published: (2024)
cuSZ-$i$: High-Ratio Scientific Lossy Compression on GPUs with Optimized Multi-Level Interpolation
by: Liu, Jinyang, et al.
Published: (2023)
by: Liu, Jinyang, et al.
Published: (2023)
Accelerating Triangle Counting with Real Processing-in-Memory Systems
by: Asquini, Lorenzo, et al.
Published: (2025)
by: Asquini, Lorenzo, et al.
Published: (2025)
Adaptive Cache Management for Complex Storage Systems Using CNN-LSTM-Based Spatiotemporal Prediction
by: Wang, Xiaoye, et al.
Published: (2024)
by: Wang, Xiaoye, et al.
Published: (2024)
Bingo: Radix-based Bias Factorization for Random Walk on Dynamic Graphs
by: Wang, Pinhuan, et al.
Published: (2025)
by: Wang, Pinhuan, et al.
Published: (2025)
ParamSpMM: Adaptive and Efficient Sparse Matrix-Matrix Multiplication on GPUs for GNNs
by: Zhang, Lixing, et al.
Published: (2026)
by: Zhang, Lixing, et al.
Published: (2026)
How to Rent GPUs on a Budget
by: Li, Zhouzi, et al.
Published: (2024)
by: Li, Zhouzi, et al.
Published: (2024)
Scalable Graph Indexing using GPUs for Approximate Nearest Neighbor Search
by: Li, Zhonggen, et al.
Published: (2025)
by: Li, Zhonggen, et al.
Published: (2025)
RSH-SpMM: A Row-Structured Hybrid Kernel for Sparse Matrix-Matrix Multiplication on GPUs
by: Li, Aiying, et al.
Published: (2026)
by: Li, Aiying, et al.
Published: (2026)
HP-MDR: High-performance and Portable Data Refactoring and Progressive Retrieval with Advanced GPUs
by: Li, Yanliang, et al.
Published: (2025)
by: Li, Yanliang, et al.
Published: (2025)
Opara: Exploiting Operator Parallelism for Expediting DNN Inference on GPUs
by: Chen, Aodong, et al.
Published: (2023)
by: Chen, Aodong, et al.
Published: (2023)
APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs
by: Fan, Jiakun, et al.
Published: (2025)
by: Fan, Jiakun, et al.
Published: (2025)
Accelerating Sparse Matrix-Matrix Multiplication on GPUs with Processing Near HBMs
by: Li, Shiju, et al.
Published: (2025)
by: Li, Shiju, et al.
Published: (2025)
Distributed Triangle Enumeration in Hypergraphs
by: Adamson, Duncan, et al.
Published: (2026)
by: Adamson, Duncan, et al.
Published: (2026)
BOA Constrictor: Squeezing Performance out of GPUs in the Cloud via Budget-Optimal Allocation
by: Li, Zhouzi, et al.
Published: (2026)
by: Li, Zhouzi, et al.
Published: (2026)
Improved Massively Parallel Triangle Counting in $O(1)$ Rounds
by: Liu, Quanquan C., et al.
Published: (2024)
by: Liu, Quanquan C., et al.
Published: (2024)
AI Surrogate Model for Distributed Computing Workloads
by: Park, David K., et al.
Published: (2024)
by: Park, David K., et al.
Published: (2024)
Alternative Mixed Integer Linear Programming Optimization for Joint Job Scheduling and Data Allocation in Grid Computing
by: Feng, Shengyu, et al.
Published: (2025)
by: Feng, Shengyu, et al.
Published: (2025)
An Adaptive Distributed Stencil Abstraction for GPUs
by: Bhosale, Aditya, et al.
Published: (2025)
by: Bhosale, Aditya, et al.
Published: (2025)
Accelerating Maximal Biclique Enumeration on GPUs
by: Hsieh, Chou-Ying, et al.
Published: (2024)
by: Hsieh, Chou-Ying, et al.
Published: (2024)
Parallelizing Maximal Clique Enumeration on GPUs
by: Almasri, Mohammad, et al.
Published: (2022)
by: Almasri, Mohammad, et al.
Published: (2022)
Optimizing sDTW for AMD GPUs
by: Latta-Lin, Daniel, et al.
Published: (2024)
by: Latta-Lin, Daniel, et al.
Published: (2024)
Accelerating Biclique Counting on GPU
by: Qiu, Linshan, et al.
Published: (2024)
by: Qiu, Linshan, et al.
Published: (2024)
Accelerating Mixture-of-Experts Inference by Hiding Offloading Latency with Speculative Decoding
by: Wang, Zhibin, et al.
Published: (2025)
by: Wang, Zhibin, et al.
Published: (2025)
TurboFFT: Co-Designed High-Performance and Fault-Tolerant Fast Fourier Transform on GPUs
by: Wu, Shixun, et al.
Published: (2024)
by: Wu, Shixun, et al.
Published: (2024)
STAR: Decode-Phase Rescheduling for LLM Inference
by: Wang, Zhibin, et al.
Published: (2025)
by: Wang, Zhibin, et al.
Published: (2025)
FlashMP: Fast Discrete Transform-Based Solver for Preconditioning Maxwell's Equations on GPUs
by: Zhang, Haoyuan, et al.
Published: (2025)
by: Zhang, Haoyuan, et al.
Published: (2025)
Serving Compound Inference Systems on Datacenter GPUs
by: Devata, Sriram, et al.
Published: (2026)
by: Devata, Sriram, et al.
Published: (2026)
Fast Kronecker Matrix-Matrix Multiplication on GPUs
by: Jangda, Abhinav, et al.
Published: (2024)
by: Jangda, Abhinav, et al.
Published: (2024)
Optimal Workload Placement on Multi-Instance GPUs
by: Turkkan, Bekir, et al.
Published: (2024)
by: Turkkan, Bekir, et al.
Published: (2024)
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
by: Gao, Wei, et al.
Published: (2026)
by: Gao, Wei, et al.
Published: (2026)
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion
by: Chang, Li-Wen, et al.
Published: (2024)
by: Chang, Li-Wen, et al.
Published: (2024)
Dynamic Scheduling Strategies for Resource Optimization in Computing Environments
by: Wang, Xiaoye
Published: (2024)
by: Wang, Xiaoye
Published: (2024)
Straggler Tolerant and Resilient DL Training on Homogeneous GPUs
by: Zhang, Zeyu, et al.
Published: (2025)
by: Zhang, Zeyu, et al.
Published: (2025)
RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs
by: Brock, Benjamin, et al.
Published: (2023)
by: Brock, Benjamin, et al.
Published: (2023)
Accurate Computation of the Logarithm of Modified Bessel Functions on GPUs
by: Plesner, Andreas, et al.
Published: (2024)
by: Plesner, Andreas, et al.
Published: (2024)
Similar Items
-
RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs
by: Xie, Xi, et al.
Published: (2024) -
Managing Multi Instance GPUs for High Throughput and Energy Savings
by: Saraha, Abhijeet, et al.
Published: (2025) -
FPTC: A Fast Parallel Transform-based Codec for Efficient Asymmetric Signal Compression
by: Mechels, Ben, et al.
Published: (2026) -
Evaluating Emerging AI/ML Accelerators: IPU, RDU, and NVIDIA/AMD GPUs
by: Peng, Hongwu, et al.
Published: (2023) -
Agent-Based Triangle Counting: Unlocking Truss Decomposition, Triangle Centrality, and Local Clustering Coefficient
by: Chand, Prabhat Kumar, et al.
Published: (2024)