:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kamath, Aditya K, Krishnamurthy, Arvind, Canini, Marco, Peter, Simon
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Distributed, Parallel, and Cluster Computing I.2.7; C.1.4; E.4
Online Access:	https://arxiv.org/abs/2605.30728
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
by: Kamath, Aditya K, et al.
Published: (2024)

Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
by: Jo, Myeong Jun
Published: (2026)

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks
by: Topcu, Burak, et al.
Published: (2026)

MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
by: Shakerdargah, Mohammadali, et al.
Published: (2024)

Nanoscaling Floating-Point (NxFP): NanoMantissa, Adaptive Microexponents, and Code Recycling for Direct-Cast Compression of Large Language Models
by: Lo, Yun-Chen, et al.
Published: (2024)

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
by: Ganjihal, Sanjeev Rao
Published: (2026)

ATTNChecker: Highly-Optimized Fault Tolerant Attention for Large Language Model Training
by: Liang, Yuhang, et al.
Published: (2024)

Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI
by: Kolluru, Saicharan
Published: (2025)

Training LLMs on HPC Systems: Best Practices from the OpenGPT-X Project
by: Penke, Carolin, et al.
Published: (2025)

Kant: An Efficient Unified Scheduling System for Large-Scale AI Clusters
by: Zeng, Lingling, et al.
Published: (2025)

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study
by: Georgiou, Athos
Published: (2026)

Parameter-Efficient and Personalized Federated Training of Generative Models at the Edge
by: Khan, Kabir, et al.
Published: (2025)

Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model
by: Chen, Mu-Chi, et al.
Published: (2025)

GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration
by: Sarker, Yeahia, et al.
Published: (2026)

SparkAttention: High-Performance Multi-Head Attention for Large Models on Volta GPU Architecture
by: Xu, Youxuan, et al.
Published: (2025)

Addressing tokens dynamic generation, propagation, storage and renewal to secure the GlideinWMS pilot based jobs and system
by: Coimbra, Bruno Moreira, et al.
Published: (2025)

DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
by: Yang, Mingyu, et al.
Published: (2025)

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture
by: Luiz, Anderson de Lima, et al.
Published: (2025)

Flex-MIG: Enabling Distributed Execution on MIG
by: Kim, Myeongsu, et al.
Published: (2025)

GREEN-CODE: Learning to Optimize Energy Efficiency in LLM-based Code Generation
by: Ilager, Shashikant, et al.
Published: (2025)

Combining Serverless and High-Performance Computing Paradigms to support ML Data-Intensive Applications
by: Staylor, Mills, et al.
Published: (2025)

ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving
by: Li, Xiangchen, et al.
Published: (2026)

WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching
by: Li, Xiangchen, et al.
Published: (2026)

Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and NVIDIA Data Center GPUs
by: Sada, Mohammad Firas, et al.
Published: (2025)

StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving
by: Nouri, Azam
Published: (2026)

Libra: Unleashing GPU Heterogeneity for High-Performance Sparse Matrix Multiplication
by: Shi, Jinliang, et al.
Published: (2025)

AutoDDL: Automatic Distributed Deep Learning with Near-Optimal Bandwidth Cost
by: Chen, Jinfan, et al.
Published: (2023)

Evaluating Large Language Models for Workload Mapping and Scheduling in Heterogeneous HPC Systems
by: Sharma, Aasish Kumar, et al.
Published: (2025)

ZenFlow: Enabling Stall-Free Offloading Training via Asynchronous Updates
by: Lan, Tingfeng, et al.
Published: (2025)

FlashSpread: IO-Aware GPU Simulation of Non-Markovian Epidemic Dynamics via Kernel Fusion
by: Shakeri, Heman, et al.
Published: (2026)

Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
by: Mitra, Subhadip
Published: (2026)

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs
by: Mileski, Dimitar, et al.
Published: (2025)

CRDT-Based Game State Synchronization in Peer-to-Peer VR
by: Dantas, Abel, et al.
Published: (2025)

Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling
by: Luo, Zizhang, et al.
Published: (2026)

Benchmarking Federated Learning for Throughput Prediction in 5G Live Streaming Applications
by: Dutta, Yuvraj, et al.
Published: (2025)

Accelerating Causal Algorithms for Industrial-scale Data: A Distributed Computing Approach with Ray Framework
by: Verma, Vishal, et al.
Published: (2024)

De-DSI: Decentralised Differentiable Search Index
by: Neague, Petru, et al.
Published: (2024)

Towards Message Brokers for Generative AI: Survey, Challenges, and Opportunities
by: Saleh, Alaa, et al.
Published: (2023)

Flash-Fusion: Enabling Expressive, Low-Latency Queries on IoT Sensor Streams with LLMs
by: Patherya, Kausar, et al.
Published: (2025)

FedMon: Federated eBPF Monitoring for Distributed Anomaly Detection in Multi-Cluster Cloud Environments
by: Zehra, Sehar, et al.
Published: (2025)