Saved in:
| Main Authors: | Kamath, Aditya K, Krishnamurthy, Arvind, Canini, Marco, Peter, Simon |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.30728 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
by: Kamath, Aditya K, et al.
Published: (2024)
by: Kamath, Aditya K, et al.
Published: (2024)
Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
by: Jo, Myeong Jun
Published: (2026)
by: Jo, Myeong Jun
Published: (2026)
Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks
by: Topcu, Burak, et al.
Published: (2026)
by: Topcu, Burak, et al.
Published: (2026)
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
by: Shakerdargah, Mohammadali, et al.
Published: (2024)
by: Shakerdargah, Mohammadali, et al.
Published: (2024)
Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models
by: Lo, Yun-Chen, et al.
Published: (2024)
by: Lo, Yun-Chen, et al.
Published: (2024)
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
by: Ganjihal, Sanjeev Rao
Published: (2026)
by: Ganjihal, Sanjeev Rao
Published: (2026)
ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training
by: Liang, Yuhang, et al.
Published: (2024)
by: Liang, Yuhang, et al.
Published: (2024)
Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI
by: Kolluru, Saicharan
Published: (2025)
by: Kolluru, Saicharan
Published: (2025)
Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project
by: Penke, Carolin, et al.
Published: (2025)
by: Penke, Carolin, et al.
Published: (2025)
Kant: An Efficient Unified Scheduling System for Large-Scale AI Clusters
by: Zeng, Lingling, et al.
Published: (2025)
by: Zeng, Lingling, et al.
Published: (2025)
Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study
by: Georgiou, Athos
Published: (2026)
by: Georgiou, Athos
Published: (2026)
Parameter-Efficient and Personalized Federated Training of Generative Models at the Edge
by: Khan, Kabir, et al.
Published: (2025)
by: Khan, Kabir, et al.
Published: (2025)
Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model
by: Chen, Mu-Chi, et al.
Published: (2025)
by: Chen, Mu-Chi, et al.
Published: (2025)
GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration
by: Sarker, Yeahia, et al.
Published: (2026)
by: Sarker, Yeahia, et al.
Published: (2026)
SparkAttention: High-Performance Multi-Head Attention for Large Models on Volta GPU Architecture
by: Xu, Youxuan, et al.
Published: (2025)
by: Xu, Youxuan, et al.
Published: (2025)
Addressing tokens dynamic generation, propagation, storage and renewal to secure the GlideinWMS pilot based jobs and system
by: Coimbra, Bruno Moreira, et al.
Published: (2025)
by: Coimbra, Bruno Moreira, et al.
Published: (2025)
DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
by: Yang, Mingyu, et al.
Published: (2025)
by: Yang, Mingyu, et al.
Published: (2025)
Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture
by: Luiz, Anderson de Lima, et al.
Published: (2025)
by: Luiz, Anderson de Lima, et al.
Published: (2025)
Flex-MIG: Enabling Distributed Execution on MIG
by: Kim, Myeongsu, et al.
Published: (2025)
by: Kim, Myeongsu, et al.
Published: (2025)
GREEN-CODE: Learning to Optimize Energy Efficiency in LLM-based Code Generation
by: Ilager, Shashikant, et al.
Published: (2025)
by: Ilager, Shashikant, et al.
Published: (2025)
Combining Serverless and High-Performance Computing Paradigms to support ML Data-Intensive Applications
by: Staylor, Mills, et al.
Published: (2025)
by: Staylor, Mills, et al.
Published: (2025)
ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving
by: Li, Xiangchen, et al.
Published: (2026)
by: Li, Xiangchen, et al.
Published: (2026)
WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching
by: Li, Xiangchen, et al.
Published: (2026)
by: Li, Xiangchen, et al.
Published: (2026)
Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs
by: Sada, Mohammad Firas, et al.
Published: (2025)
by: Sada, Mohammad Firas, et al.
Published: (2025)
StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving
by: Nouri, Azam
Published: (2026)
by: Nouri, Azam
Published: (2026)
Libra: Unleashing GPU Heterogeneity for High-Performance Sparse Matrix Multiplication
by: Shi, Jinliang, et al.
Published: (2025)
by: Shi, Jinliang, et al.
Published: (2025)
AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost
by: Chen, Jinfan, et al.
Published: (2023)
by: Chen, Jinfan, et al.
Published: (2023)
Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems
by: Sharma, Aasish Kumar, et al.
Published: (2025)
by: Sharma, Aasish Kumar, et al.
Published: (2025)
ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
by: Lan, Tingfeng, et al.
Published: (2025)
by: Lan, Tingfeng, et al.
Published: (2025)
FlashSpread: IO-Aware GPU Simulation of Non-Markovian Epidemic Dynamics via Kernel Fusion
by: Shakeri, Heman, et al.
Published: (2026)
by: Shakeri, Heman, et al.
Published: (2026)
Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
by: Mitra, Subhadip
Published: (2026)
by: Mitra, Subhadip
Published: (2026)
Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs
by: Mileski, Dimitar, et al.
Published: (2025)
by: Mileski, Dimitar, et al.
Published: (2025)
CRDT-Based Game State Synchronization in Peer-to-Peer VR
by: Dantas, Abel, et al.
Published: (2025)
by: Dantas, Abel, et al.
Published: (2025)
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
by: Luo, Zizhang, et al.
Published: (2026)
by: Luo, Zizhang, et al.
Published: (2026)
Benchmarking Federated Learning for Throughput Prediction in 5G Live Streaming Applications
by: Dutta, Yuvraj, et al.
Published: (2025)
by: Dutta, Yuvraj, et al.
Published: (2025)
Accelerating Causal Algorithms for Industrial-scale Data: A Distributed Computing Approach with Ray Framework
by: Verma, Vishal, et al.
Published: (2024)
by: Verma, Vishal, et al.
Published: (2024)
De-DSI: Decentralised Differentiable Search Index
by: Neague, Petru, et al.
Published: (2024)
by: Neague, Petru, et al.
Published: (2024)
Towards Message Brokers for Generative AI: Survey, Challenges, and Opportunities
by: Saleh, Alaa, et al.
Published: (2023)
by: Saleh, Alaa, et al.
Published: (2023)
Flash-Fusion: Enabling Expressive, Low-Latency Queries on IoT Sensor Streams with LLMs
by: Patherya, Kausar, et al.
Published: (2025)
by: Patherya, Kausar, et al.
Published: (2025)
FedMon: Federated eBPF Monitoring for Distributed Anomaly Detection in Multi-Cluster Cloud Environments
by: Zehra, Sehar, et al.
Published: (2025)
by: Zehra, Sehar, et al.
Published: (2025)
Similar Items
-
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
by: Kamath, Aditya K, et al.
Published: (2024) -
Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
by: Jo, Myeong Jun
Published: (2026) -
Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks
by: Topcu, Burak, et al.
Published: (2026) -
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
by: Shakerdargah, Mohammadali, et al.
Published: (2024) -
Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models
by: Lo, Yun-Chen, et al.
Published: (2024)