Saved in:
| Main Authors: | Mukherjee, Soutrik, Cha, Sangwhan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.28708 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Big Data-Driven Fraud Detection Using Machine Learning and Real-Time Stream Processing
by: Liu, Chen, et al.
Published: (2025)
by: Liu, Chen, et al.
Published: (2025)
Developing a Blockchain-Based Secure Digital Contents Distribution System
by: Qadri, Syed Mohiuddin, et al.
Published: (2025)
by: Qadri, Syed Mohiuddin, et al.
Published: (2025)
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026)
by: Wang, Chong, et al.
Published: (2026)
Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution
by: Li, Zhuojin, et al.
Published: (2025)
by: Li, Zhuojin, et al.
Published: (2025)
FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification
by: Wu, Wenqing
Published: (2023)
by: Wu, Wenqing
Published: (2023)
Floe: Federated Specialization for Real-Time LLM-SLM Inference
by: Tian, Chunlin, et al.
Published: (2026)
by: Tian, Chunlin, et al.
Published: (2026)
GeoT: Tensor Centric Library for Graph Neural Network via Efficient Segment Reduction on GPU
by: Yu, Zhongming, et al.
Published: (2024)
by: Yu, Zhongming, et al.
Published: (2024)
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
by: Zhang, Haolin, et al.
Published: (2025)
by: Zhang, Haolin, et al.
Published: (2025)
Accelerate Intermittent Deep Inference
by: Zhang, Ziliang
Published: (2024)
by: Zhang, Ziliang
Published: (2024)
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
by: Xu, Tairan, et al.
Published: (2025)
by: Xu, Tairan, et al.
Published: (2025)
RAPID-Serve: Resource-efficient and Accelerated P/D Intra-GPU Disaggregation
by: Masood, Amna, et al.
Published: (2026)
by: Masood, Amna, et al.
Published: (2026)
FloatSOM: GPU-Accelerated, Distributed, Topology-Flexible Self-Organizing Maps
by: Xu, Tony, et al.
Published: (2026)
by: Xu, Tony, et al.
Published: (2026)
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
by: Recasens, Pol G., et al.
Published: (2025)
by: Recasens, Pol G., et al.
Published: (2025)
Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference
by: Luo, Shuqing, et al.
Published: (2025)
by: Luo, Shuqing, et al.
Published: (2025)
Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing
by: Zhao, Lingxiao, et al.
Published: (2025)
by: Zhao, Lingxiao, et al.
Published: (2025)
TensAIR: Real-Time Training of Neural Networks from Data-streams
by: Tosi, Mauro D. L., et al.
Published: (2022)
by: Tosi, Mauro D. L., et al.
Published: (2022)
Revati: Transparent GPU-Free Time-Warp Emulation for LLM Serving
by: Agrawal, Amey, et al.
Published: (2026)
by: Agrawal, Amey, et al.
Published: (2026)
LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load
by: Tummalapalli, Pranay, et al.
Published: (2026)
by: Tummalapalli, Pranay, et al.
Published: (2026)
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2025)
by: Zhong, Shuzhang, et al.
Published: (2025)
Maya: Optimizing Deep Learning Training Workloads using GPU Runtime Emulation
by: Yarlagadda, Srihas, et al.
Published: (2025)
by: Yarlagadda, Srihas, et al.
Published: (2025)
Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026)
by: Maczan, Jędrzej
Published: (2026)
Distributed client selection with multi-objective in federated learning assisted Internet of Vehicles
by: Cha, Narisu, et al.
Published: (2024)
by: Cha, Narisu, et al.
Published: (2024)
FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations
by: Shu, Zhihao, et al.
Published: (2026)
by: Shu, Zhihao, et al.
Published: (2026)
Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
by: Ochiai, Yoichi
Published: (2026)
by: Ochiai, Yoichi
Published: (2026)
70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)
by: Zhang, Tianyi, et al.
Published: (2025)
by: Zhang, Tianyi, et al.
Published: (2025)
Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization
by: Li, Jinhao, et al.
Published: (2023)
by: Li, Jinhao, et al.
Published: (2023)
Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks
by: Naman, Pranjal, et al.
Published: (2025)
by: Naman, Pranjal, et al.
Published: (2025)
Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
by: Huang, Guang, et al.
Published: (2026)
by: Huang, Guang, et al.
Published: (2026)
HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference
by: Zhang, Zeyu, et al.
Published: (2025)
by: Zhang, Zeyu, et al.
Published: (2025)
Split CNN Inference on Networked Microcontrollers
by: Lu, Junyu, et al.
Published: (2026)
by: Lu, Junyu, et al.
Published: (2026)
LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
by: Park, Jeongmin Brian, et al.
Published: (2024)
by: Park, Jeongmin Brian, et al.
Published: (2024)
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)
by: Prabhakar, Rohan Baskar, et al.
Published: (2024)
GPU Cluster Scheduling for Network-Sensitive Deep Learning
by: Sharma, Aakash, et al.
Published: (2024)
by: Sharma, Aakash, et al.
Published: (2024)
Distributed Graph Neural Network Inference With Just-In-Time Compilation For Industry-Scale Graphs
by: Wu, Xiabao, et al.
Published: (2025)
by: Wu, Xiabao, et al.
Published: (2025)
GPU-Accelerated Synthesis of Mixed-Boolean Arithmetic: Beyond Caching
by: Bathie, Gabriel, et al.
Published: (2026)
by: Bathie, Gabriel, et al.
Published: (2026)
Single-GPU GNN Systems: Traps and Pitfalls
by: Gong, Yidong, et al.
Published: (2024)
by: Gong, Yidong, et al.
Published: (2024)
FastGL: A GPU-Efficient Framework for Accelerating Sampling-Based GNN Training at Large Scale
by: Zhu, Zeyu, et al.
Published: (2024)
by: Zhu, Zeyu, et al.
Published: (2024)
BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
by: Okanovic, Patrik, et al.
Published: (2025)
by: Okanovic, Patrik, et al.
Published: (2025)
SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling
by: Skiadopoulos, Athinagoras, et al.
Published: (2025)
by: Skiadopoulos, Athinagoras, et al.
Published: (2025)
Priority-Aware Model-Distributed Inference at Edge Networks
by: Li, Teng, et al.
Published: (2024)
by: Li, Teng, et al.
Published: (2024)
Similar Items
-
Big Data-Driven Fraud Detection Using Machine Learning and Real-Time Stream Processing
by: Liu, Chen, et al.
Published: (2025) -
Developing a Blockchain-Based Secure Digital Contents Distribution System
by: Qadri, Syed Mohiuddin, et al.
Published: (2025) -
Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization
by: Wang, Chong, et al.
Published: (2026) -
Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution
by: Li, Zhuojin, et al.
Published: (2025) -
FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification
by: Wu, Wenqing
Published: (2023)