:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Mukherjee, Soutrik, Cha, Sangwhan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing
Online Access:	https://arxiv.org/abs/2603.28708
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Big Data-Driven Fraud Detection Using Machine Learning and Real-Time Stream Processing
by: Liu, Chen, et al.
Published: (2025)

Developing a Blockchain-Based Secure Digital Contents Distribution System
by: Qadri, Syed Mohiuddin, et al.
Published: (2025)

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026)

Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution
by: Li, Zhuojin, et al.
Published: (2025)

FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification
by: Wu, Wenqing
Published: (2023)

Floe: Federated Specialization for Real-Time LLM-SLM Inference
by: Tian, Chunlin, et al.
Published: (2026)

GeoT: Tensor Centric Library for Graph Neural Network via Efficient Segment Reduction on GPU
by: Yu, Zhongming, et al.
Published: (2024)

Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
by: Zhang, Haolin, et al.
Published: (2025)

Accelerate Intermittent Deep Inference
by: Zhang, Ziliang
Published: (2024)

MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
by: Xu, Tairan, et al.
Published: (2025)

RAPID-Serve: Resource-efficient and Accelerated P/D Intra-GPU Disaggregation
by: Masood, Amna, et al.
Published: (2026)

FloatSOM: GPU-Accelerated, Distributed, Topology-Flexible Self-Organizing Maps
by: Xu, Tony, et al.
Published: (2026)

Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
by: Recasens, Pol G., et al.
Published: (2025)

Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)

Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing
by: Zhao, Lingxiao, et al.
Published: (2025)

TensAIR: Real-Time Training of Neural Networks from Data-streams
by: Tosi, Mauro D. L., et al.
Published: (2022)

Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving
by: Agrawal, Amey, et al.
Published: (2026)

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load
by: Tummalapalli, Pranay, et al.
Published: (2026)

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2025)

Maya: Optimizing Deep Learning Training Workloads using GPU Runtime Emulation
by: Yarlagadda, Srihas, et al.
Published: (2025)

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026)

Distributed client selection with multi-objective in federated learning assisted Internet of Vehicles
by: Cha, Narisu, et al.
Published: (2024)

FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations
by: Shu, Zhihao, et al.
Published: (2026)

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
by: Ochiai, Yoichi
Published: (2026)

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
by: Zhang, Tianyi, et al.
Published: (2025)

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization
by: Li, Jinhao, et al.
Published: (2023)

Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks
by: Naman, Pranjal, et al.
Published: (2025)

Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
by: Huang, Guang, et al.
Published: (2026)

HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference
by: Zhang, Zeyu, et al.
Published: (2025)

Split CNN Inference on Networked Microcontrollers
by: Lu, Junyu, et al.
Published: (2026)

LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
by: Park, Jeongmin Brian, et al.
Published: (2024)

Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)

GPU Cluster Scheduling for Network-Sensitive Deep Learning
by: Sharma, Aakash, et al.
Published: (2024)

Distributed Graph Neural Network Inference With Just-In-Time Compilation For Industry-Scale Graphs
by: Wu, Xiabao, et al.
Published: (2025)

GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching
by: Bathie, Gabriel, et al.
Published: (2026)

Single-GPU GNN Systems: Traps and Pitfalls
by: Gong, Yidong, et al.
Published: (2024)

FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
by: Zhu, Zeyu, et al.
Published: (2024)

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
by: Okanovic, Patrik, et al.
Published: (2025)

SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling
by: Skiadopoulos, Athinagoras, et al.
Published: (2025)

Priority-Aware Model-Distributed Inference at Edge Networks
by: Li, Teng, et al.
Published: (2024)