:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sun, Yongqian, Pan, Xijie, Xiong, Xiao, Tao, Lei, Wang, Jiaju, Zhang, Shenglin, Yuan, Yuan, Li, Yuqi, Jian, Kunlin
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2506.20673
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Efficient Fault Localization in a Cloud Stack Using End-to-End Application Service Topology
by: Mathews, Dhanya R, et al.
Published: (2025)

FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention
by: Dai, Huangliang, et al.
Published: (2025)

A Proposed End-To-End Principle for Data Commons
by: Grossman, Robert L.
Published: (2025)

End-to-End and Phase-Level Performance Optimization for Hyperledger Fabric
by: Sollu, Pavan, et al.
Published: (2026)

Solutions for Distributed Memory Access Mechanism on HPC Clusters
by: Meizner, Jan, et al.
Published: (2025)

Saarthi: An End-to-End Intelligent Platform for Optimising Distributed Serverless Workloads
by: Agarwal, Siddharth, et al.
Published: (2025)

A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management
by: Sun, Yongqian, et al.
Published: (2024)

Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge
by: Wu, Yebo, et al.
Published: (2026)

Leveraging Teaching on Demand: Approaching HPC to Undergrads
by: Catalán, S., et al.
Published: (2026)

Cost-Effective Edge Data Distribution with End-To-End Delay Guarantees in Edge Computing
by: Shankar, Ravi, et al.
Published: (2025)

A Survey of End-to-End Modeling for Distributed DNN Training: Workloads, Simulators, and TCO
by: Svedas, Jonas, et al.
Published: (2025)

Hamava: Fault-tolerant Reconfigurable Geo-Replication on Heterogeneous Clusters
by: Mane, Tejas, et al.
Published: (2024)

Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters
by: Iserte, Sergio, et al.
Published: (2025)

KubeIntellect: A Modular LLM-Orchestrated Agent Framework for End-to-End Kubernetes Management
by: Ardebili, Mohsen Seyedkazemi, et al.
Published: (2025)

Introducing JIRIAF: A Virtual Kubelet Integration for Optimizing HPC Resource Provisioning
by: Gyurjyan, Vardan, et al.
Published: (2025)

Evaluating Malleable Job Scheduling in HPC Clusters using Real-World Workloads
by: Zojer, Patrick, et al.
Published: (2026)

Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads
by: Scheinert, Dominik, et al.
Published: (2026)

Improving the End-to-End Efficiency of Offline Inference for Multi-LLM Applications Based on Sampling and Simulation
by: Fang, Jingzhi, et al.
Published: (2025)

Attack Graph Generation on HPC Clusters
by: Li, Ming, et al.
Published: (2025)

Odyssey: An End-to-End System for Pareto-Optimal Serverless Query Processing
by: Jesalpura, Shyam, et al.
Published: (2025)

Understanding Large-Scale HPC System Behavior Through Cluster-Based Visual Analytics
by: Austin, Allison, et al.
Published: (2026)

Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads
by: Merzky, Andre, et al.
Published: (2025)

MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization
by: Hu, Rizhen, et al.
Published: (2025)

Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs
by: Ni, Yinan, et al.
Published: (2025)

An Incremental Multi-Level, Multi-Scale Approach to Assessment of Multifidelity HPC Systems
by: Shilpika, Shilpika, et al.
Published: (2025)

Resilient Packet Forwarding: A Reinforcement Learning Approach to Routing in Gaussian Interconnected Networks with Clustered Faults
by: Charrwi, Mohammad Walid, et al.
Published: (2025)

DiT-HC: Enabling Efficient Training of Visual Generation Model DiT on HPC-oriented CPU Cluster
by: Zhang, Jinxiao, et al.
Published: (2026)

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs
by: Sun, Mo, et al.
Published: (2024)

Deal: Distributed End-to-End GNN Inference for All Nodes
by: Chen, Shiyang, et al.
Published: (2025)

HPC with Enhanced User Separation
by: Prout, Andrew, et al.
Published: (2024)

Analysis of the carbon footprint of HPC
by: Benhari, Abdessalam, et al.
Published: (2025)

Minos: Systematically Classifying Performance and Power Characteristics of GPU Workloads on HPC Clusters
by: Jain, Rutwik, et al.
Published: (2026)

A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models
by: Sharma, Harsh, et al.
Published: (2023)

On the Convergence of Malleability and the HPC PowerStack: Exploiting Dynamism in Over-Provisioned and Power-Constrained HPC Systems
by: Arima, Eishi, et al.
Published: (2024)

MRSch: Multi-Resource Scheduling for HPC
by: Li, Boyang, et al.
Published: (2024)

Using Malware Detection Techniques for HPC Application Classification
by: Jakobsche, Thomas, et al.
Published: (2024)

nncase: An End-to-End Compiler for Efficient LLM Deployment on Heterogeneous Storage Architectures
by: Guo, Hui, et al.
Published: (2025)

UNR: Unified Notifiable RMA Library for HPC
by: Feng, Guangnan, et al.
Published: (2024)

An Elastic Job Scheduler for HPC Applications on the Cloud
by: Bhosale, Aditya, et al.
Published: (2025)

Sarus Suite: Cloud-native Containers for HPC
by: Madonna, Alberto, et al.
Published: (2026)