Saved in:
| Main Authors: | Lei, Zhenyu, Hao, Jin-Kao, Wu, Qinghua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.17357 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A GPU-Accelerated Hybrid Method for a Class of Multi-Depot Vehicle Routing Problems
by: Lei, Zhenyu, et al.
Published: (2026)
by: Lei, Zhenyu, et al.
Published: (2026)
Accelerating Large Language Model Training with Hybrid GPU-based Compression
by: Xu, Lang, et al.
Published: (2024)
by: Xu, Lang, et al.
Published: (2024)
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism
by: Zhao, Long, et al.
Published: (2026)
by: Zhao, Long, et al.
Published: (2026)
Towards Scalable GPU-Accelerated SNN Training via Temporal Fusion
by: Li, Yanchen, et al.
Published: (2024)
by: Li, Yanchen, et al.
Published: (2024)
FedPAW: Federated Learning with Personalized Aggregation Weights for Urban Vehicle Speed Prediction
by: He, Yuepeng, et al.
Published: (2024)
by: He, Yuepeng, et al.
Published: (2024)
HadaCore: Tensor Core Accelerated Hadamard Transform Kernel
by: Agarwal, Krish, et al.
Published: (2024)
by: Agarwal, Krish, et al.
Published: (2024)
MSCCL++: Rethinking GPU Communication Abstractions for AI Inference
by: Hwang, Changho, et al.
Published: (2025)
by: Hwang, Changho, et al.
Published: (2025)
Accelerated Digital Twin Learning for Edge AI: A Comparison of FPGA and Mobile GPU
by: Xu, Bin, et al.
Published: (2025)
by: Xu, Bin, et al.
Published: (2025)
ProbSelect: Stochastic Client Selection for GPU-Accelerated Compute Devices in the 3D Continuum
by: Stanisic, Andrija, et al.
Published: (2025)
by: Stanisic, Andrija, et al.
Published: (2025)
SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization
by: Tschand, Arya, et al.
Published: (2025)
by: Tschand, Arya, et al.
Published: (2025)
Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU
by: Spoczynski, Marcin, et al.
Published: (2026)
by: Spoczynski, Marcin, et al.
Published: (2026)
UCCL-Zip: Lossless Compression Supercharged GPU Communication
by: Ma, Shuang, et al.
Published: (2026)
by: Ma, Shuang, et al.
Published: (2026)
GPU-Virt-Bench: A Comprehensive Benchmarking Framework for Software-Based GPU Virtualization Systems
by: VG, Jithin, et al.
Published: (2025)
by: VG, Jithin, et al.
Published: (2025)
A Survey on Large Language Model Acceleration based on KV Cache Management
by: Li, Haoyang, et al.
Published: (2024)
by: Li, Haoyang, et al.
Published: (2024)
Speeding up Policy Simulation in Supply Chain RL
by: Farias, Vivek, et al.
Published: (2024)
by: Farias, Vivek, et al.
Published: (2024)
Accelerating Sparse MTTKRP for Small Tensor Decomposition on GPU
by: Wijeratne, Sasindu, et al.
Published: (2025)
by: Wijeratne, Sasindu, et al.
Published: (2025)
A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems
by: Wu, Qi, et al.
Published: (2026)
by: Wu, Qi, et al.
Published: (2026)
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
by: Liu, Man, et al.
Published: (2026)
by: Liu, Man, et al.
Published: (2026)
Power- and Fragmentation-aware Online Scheduling for GPU Datacenters
by: Lettich, Francesco, et al.
Published: (2024)
by: Lettich, Francesco, et al.
Published: (2024)
Beyond the GPU: The Strategic Role of FPGAs in the Next Wave of AI
by: Jiménez, Arturo Urías
Published: (2025)
by: Jiménez, Arturo Urías
Published: (2025)
Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores
by: Schieffer, Gabin, et al.
Published: (2024)
by: Schieffer, Gabin, et al.
Published: (2024)
STAGE: A Symbolic Tensor grAph GEnerator for distributed AI system co-design
by: Man, Changhai, et al.
Published: (2025)
by: Man, Changhai, et al.
Published: (2025)
Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity
by: Zhu, Wenbin, et al.
Published: (2025)
by: Zhu, Wenbin, et al.
Published: (2025)
OrchMLLM: Orchestrate Multimodal Data with Batch Post-Balancing to Accelerate Multimodal Large Language Model Training
by: Zheng, Yijie, et al.
Published: (2025)
by: Zheng, Yijie, et al.
Published: (2025)
An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU
by: Yang, Ruijia, et al.
Published: (2026)
by: Yang, Ruijia, et al.
Published: (2026)
Accurate GPU Memory Prediction for Deep Learning Jobs through Dynamic Analysis
by: Shi, Jiabo, et al.
Published: (2025)
by: Shi, Jiabo, et al.
Published: (2025)
Reducing Fragmentation and Starvation in GPU Clusters through Dynamic Multi-Objective Scheduling
by: Mamirov, Akhmadillo
Published: (2025)
by: Mamirov, Akhmadillo
Published: (2025)
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving
by: Wang, Haodong, et al.
Published: (2025)
by: Wang, Haodong, et al.
Published: (2025)
Dynamic Pricing for Electric Vehicle Charging
by: Kalakanti, Arun Kumar, et al.
Published: (2024)
by: Kalakanti, Arun Kumar, et al.
Published: (2024)
Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study
by: Wiesner, Philipp, et al.
Published: (2026)
by: Wiesner, Philipp, et al.
Published: (2026)
Latency-Aware 2-Opt Monotonic Local Search for Distributed Constraint Optimization
by: Rachmut, Ben, et al.
Published: (2025)
by: Rachmut, Ben, et al.
Published: (2025)
Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure
by: Guo, Yongjian, et al.
Published: (2026)
by: Guo, Yongjian, et al.
Published: (2026)
PolyKAN: Efficient Fused GPU Operators for Polynomial Kolmogorov-Arnold Network Variants
by: Yu, Mingkun, et al.
Published: (2025)
by: Yu, Mingkun, et al.
Published: (2025)
FairKV: Balancing Per-Head KV Cache for Fast Multi-GPU Inference
by: Zhao, Bingzhe, et al.
Published: (2025)
by: Zhao, Bingzhe, et al.
Published: (2025)
Practical offloading for fine-tuning LLM on commodity GPU via learned sparse projectors
by: Chen, Siyuan, et al.
Published: (2024)
by: Chen, Siyuan, et al.
Published: (2024)
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures
by: Argerich, Mauricio Fadel, et al.
Published: (2026)
by: Argerich, Mauricio Fadel, et al.
Published: (2026)
Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution
by: Sui, Yifan, et al.
Published: (2026)
by: Sui, Yifan, et al.
Published: (2026)
ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants
by: Mai, Haohui, et al.
Published: (2026)
by: Mai, Haohui, et al.
Published: (2026)
A Parallel CPU-GPU Framework for Batching Heuristic Operations in Depth-First Heuristic Search
by: Futuhi, Ehsan, et al.
Published: (2025)
by: Futuhi, Ehsan, et al.
Published: (2025)
Distributed Low-Communication Training with Decoupled Momentum Optimization
by: Nedelkoski, Sasho, et al.
Published: (2025)
by: Nedelkoski, Sasho, et al.
Published: (2025)
Similar Items
-
A GPU-Accelerated Hybrid Method for a Class of Multi-Depot Vehicle Routing Problems
by: Lei, Zhenyu, et al.
Published: (2026) -
Accelerating Large Language Model Training with Hybrid GPU-based Compression
by: Xu, Lang, et al.
Published: (2024) -
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism
by: Zhao, Long, et al.
Published: (2026) -
Towards Scalable GPU-Accelerated SNN Training via Temporal Fusion
by: Li, Yanchen, et al.
Published: (2024) -
FedPAW: Federated Learning with Personalized Aggregation Weights for Urban Vehicle Speed Prediction
by: He, Yuepeng, et al.
Published: (2024)