Saved in:
| Main Authors: | Stubbs, Joe, Padhy, Smruti, Cardone, Richard |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.03349 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
GPU Cluster Scheduling for Network-Sensitive Deep Learning
by: Sharma, Aakash, et al.
Published: (2024)
by: Sharma, Aakash, et al.
Published: (2024)
Is Intelligence the Right Direction in New OS Scheduling for Multiple Resources in Cloud Environments?
by: Dou, Xinglei, et al.
Published: (2025)
by: Dou, Xinglei, et al.
Published: (2025)
Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms
by: Lin, Zhongyi, et al.
Published: (2024)
by: Lin, Zhongyi, et al.
Published: (2024)
Prompt-Aware Scheduling for Low-Latency LLM Serving
by: Tao, Yiheng, et al.
Published: (2025)
by: Tao, Yiheng, et al.
Published: (2025)
Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization
by: Merouani, Massinissa, et al.
Published: (2025)
by: Merouani, Massinissa, et al.
Published: (2025)
A Comparative Study of OpenMP Scheduling Algorithm Selection Strategies
by: Korndörfer, Jonas H. Müller, et al.
Published: (2025)
by: Korndörfer, Jonas H. Müller, et al.
Published: (2025)
KVDirect: Distributed Disaggregated LLM Inference
by: Chen, Shiyang, et al.
Published: (2024)
by: Chen, Shiyang, et al.
Published: (2024)
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
by: Jiang, Chaoyi, et al.
Published: (2024)
by: Jiang, Chaoyi, et al.
Published: (2024)
When Less is More: Achieving Faster Convergence in Distributed Edge Machine Learning
by: Basani, Advik Raj, et al.
Published: (2024)
by: Basani, Advik Raj, et al.
Published: (2024)
iSpLib: A Library for Accelerating Graph Neural Networks using Auto-tuned Sparse Operations
by: Anik, Md Saidul Hoque, et al.
Published: (2024)
by: Anik, Md Saidul Hoque, et al.
Published: (2024)
cedar: Optimized and Unified Machine Learning Input Data Pipelines
by: Zhao, Mark, et al.
Published: (2024)
by: Zhao, Mark, et al.
Published: (2024)
MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations
by: Dutta, Akash, et al.
Published: (2024)
by: Dutta, Akash, et al.
Published: (2024)
AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference
by: Zhao, Xuanlei, et al.
Published: (2024)
by: Zhao, Xuanlei, et al.
Published: (2024)
MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs
by: Sarkar, Aishwarya, et al.
Published: (2024)
by: Sarkar, Aishwarya, et al.
Published: (2024)
Less is More: Optimizing Function Calling for LLM Execution on Edge Devices
by: Paramanayakam, Varatheepan, et al.
Published: (2024)
by: Paramanayakam, Varatheepan, et al.
Published: (2024)
BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures
by: Hu, Xiannan, et al.
Published: (2025)
by: Hu, Xiannan, et al.
Published: (2025)
Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation
by: Zarkadas, Ioannis, et al.
Published: (2025)
by: Zarkadas, Ioannis, et al.
Published: (2025)
InkStream: Real-time GNN Inference on Streaming Graphs via Incremental Update
by: Wu, Dan, et al.
Published: (2023)
by: Wu, Dan, et al.
Published: (2023)
Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors
by: Titopoulos, Vasileios, et al.
Published: (2025)
by: Titopoulos, Vasileios, et al.
Published: (2025)
TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition
by: Vellaisamy, Prabhu, et al.
Published: (2026)
by: Vellaisamy, Prabhu, et al.
Published: (2026)
DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction
by: Mumenin, Khondoker Mirazul, et al.
Published: (2025)
by: Mumenin, Khondoker Mirazul, et al.
Published: (2025)
xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads
by: Shi, Jiabo, et al.
Published: (2025)
by: Shi, Jiabo, et al.
Published: (2025)
IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency
by: Ghafouri, Saeid, et al.
Published: (2023)
by: Ghafouri, Saeid, et al.
Published: (2023)
CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference
by: Xu, Guanyu, et al.
Published: (2025)
by: Xu, Guanyu, et al.
Published: (2025)
Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution
by: Li, Zhuojin, et al.
Published: (2025)
by: Li, Zhuojin, et al.
Published: (2025)
Multi-DNN Inference of Sparse Models on Edge SoCs
by: Luo, Jiawei, et al.
Published: (2026)
by: Luo, Jiawei, et al.
Published: (2026)
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
by: Sridharan, Srinivas, et al.
Published: (2026)
by: Sridharan, Srinivas, et al.
Published: (2026)
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
by: Gupta, Ahan, et al.
Published: (2026)
by: Gupta, Ahan, et al.
Published: (2026)
ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor Decomposition
by: Helal, Ahmed E., et al.
Published: (2025)
by: Helal, Ahmed E., et al.
Published: (2025)
Execution time budget assignment for mixed criticality systems
by: Khelassi, Mohamed Amine, et al.
Published: (2023)
by: Khelassi, Mohamed Amine, et al.
Published: (2023)
Ecomap: Sustainability-Driven Optimization of Multi-Tenant DNN Execution on Edge Servers
by: Paramanayakam, Varatheepan, et al.
Published: (2025)
by: Paramanayakam, Varatheepan, et al.
Published: (2025)
Distributed Matrix-Based Sampling for Graph Neural Network Training
by: Tripathy, Alok, et al.
Published: (2023)
by: Tripathy, Alok, et al.
Published: (2023)
CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload
by: Shahbazinia, Amirhossein, et al.
Published: (2025)
by: Shahbazinia, Amirhossein, et al.
Published: (2025)
Glinthawk: A Two-Tiered Architecture for Offline LLM Inference
by: Hamadanian, Pouya, et al.
Published: (2025)
by: Hamadanian, Pouya, et al.
Published: (2025)
CARMA: Collocation-Aware Resource Manager
by: Yousefzadeh-Asl-Miandoab, Ehsan, et al.
Published: (2025)
by: Yousefzadeh-Asl-Miandoab, Ehsan, et al.
Published: (2025)
Ariel-ML: Computing Parallelization with Embedded Rust for Neural Networks on Heterogeneous Multi-core Microcontrollers
by: Huang, Zhaolan, et al.
Published: (2025)
by: Huang, Zhaolan, et al.
Published: (2025)
You Don't Need All Attentions: Distributed Dynamic Fine-Tuning for Foundation Models
by: Ding, Shiwei, et al.
Published: (2025)
by: Ding, Shiwei, et al.
Published: (2025)
Tuning the Tuner: Introducing Hyperparameter Optimization for Auto-Tuning
by: Willemsen, Floris-Jan, et al.
Published: (2025)
by: Willemsen, Floris-Jan, et al.
Published: (2025)
A Practical Two-Stage Framework for GPU Resource and Power Prediction in Heterogeneous HPC Systems
by: Oztop, Beste, et al.
Published: (2026)
by: Oztop, Beste, et al.
Published: (2026)
Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
by: Maczan, Jędrzej
Published: (2026)
by: Maczan, Jędrzej
Published: (2026)
Similar Items
-
GPU Cluster Scheduling for Network-Sensitive Deep Learning
by: Sharma, Aakash, et al.
Published: (2024) -
Is Intelligence the Right Direction in New OS Scheduling for Multiple Resources in Cloud Environments?
by: Dou, Xinglei, et al.
Published: (2025) -
Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms
by: Lin, Zhongyi, et al.
Published: (2024) -
Prompt-Aware Scheduling for Low-Latency LLM Serving
by: Tao, Yiheng, et al.
Published: (2025) -
Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization
by: Merouani, Massinissa, et al.
Published: (2025)