:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ji, Cheng, Luo, Huaiying
Format:	Preprint
Published:	2025
Subjects:	Distributed, Parallel, and Cluster Computing Artificial Intelligence
Online Access:	https://arxiv.org/abs/2505.11743
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Leveraging Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms
by: Ji, Cheng, et al.
Published: (2025)

Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments
by: Jin, Yihong, et al.
Published: (2025)

Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
by: Xie, Zhiqiang, et al.
Published: (2024)

Intelligent Autonomous Orchestration for Distributed Cloud Resources using Complex-Stability Analysis
by: Shyam, Gopal Krishna, et al.
Published: (2026)

The AI_INFN Platform: Artificial Intelligence Development in the Cloud
by: Anderlini, Lucio, et al.
Published: (2025)

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
by: Han, Shujie, et al.
Published: (2026)

Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure
by: Guo, Yongjian, et al.
Published: (2026)

AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research
by: Heredia, Ignacio, et al.
Published: (2025)

Efficient Multi-Model Orchestration for Self-Hosted Large Language Models
by: Vangala, Bhanu Prakash, et al.
Published: (2025)

Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems
by: Yang, Haowei, et al.
Published: (2025)

EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
by: Cheng, Jialiang, et al.
Published: (2024)

Connecting Large Language Models with Blockchain: Advancing the Evolution of Smart Contracts from Automation to Intelligence
by: Xian, Youquan, et al.
Published: (2024)

HPRM: High-Performance Robotic Middleware for Intelligent Autonomous Systems
by: Kwok, Jacky, et al.
Published: (2024)

Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
by: Chen, Xinquan, et al.
Published: (2026)

Scalable Cloud-Native Architectures for Intelligent PMU Data Processing
by: Chockalingam, Nachiappan, et al.
Published: (2025)

AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework
by: Luo, Xubin, et al.
Published: (2026)

Autonomous Systems Dependability in the era of AI: Design Challenges in Safety, Security, Reliability and Certification
by: Ranjbar, Behnaz, et al.
Published: (2026)

A Self-Healing and Fault-Tolerant Cloud-based Digital Twin Processing Management Model
by: Saxena, Deepika, et al.
Published: (2025)

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
by: Mao, Ziming, et al.
Published: (2024)

Building AI Agents for Autonomous Clouds: Challenges and Design Principles
by: Shetty, Manish, et al.
Published: (2024)

Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
by: Shi, Honghao, et al.
Published: (2024)

Intelligent Load Balancing in Cloud Computer Systems
by: Sliwko, Leszek
Published: (2025)

AI Factories: It's time to rethink the Cloud-HPC divide
by: Lopez, Pedro Garcia, et al.
Published: (2025)

Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection
by: Pasandideh, Faezeh, et al.
Published: (2026)

Role-Based Fault Tolerance System for LLM RL Post-Training
by: Chen, Zhenqian, et al.
Published: (2025)

AI-Driven Cloud Resource Optimization for Multi-Cluster Environments
by: Punniyamoorthy, Vinoth, et al.
Published: (2025)

A Survey on Failure Analysis and Fault Injection in AI Systems
by: Yu, Guangba, et al.
Published: (2024)

Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
by: Lu, Ning, et al.
Published: (2024)

Adaptive AI-based Decentralized Resource Management in the Cloud-Edge Continuum
by: Li, Lanpei, et al.
Published: (2025)

ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System
by: Sun, Yongqian, et al.
Published: (2025)

Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
by: Kim, Taeyoon, et al.
Published: (2026)

A Meta-Heuristic Load Balancer for Cloud Computing Systems
by: Sliwko, Leszek, et al.
Published: (2025)

Scaling Performance of Large Language Model Pretraining
by: Interrante-Grant, Alexander, et al.
Published: (2025)

Keep Your Friends Close: Leveraging Affinity Groups to Accelerate AI Inference Workflows
by: Garrett, Thiago, et al.
Published: (2023)

Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data
by: Mudgal, Priyanka, et al.
Published: (2024)

ECCENTRIC: Edge-Cloud Collaboration Framework for Distributed Inference Using Knowledge Adaptation
by: Kamani, Mohammad Mahdi, et al.
Published: (2025)

The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science
by: Shin, Woong, et al.
Published: (2025)

Towards using Reinforcement Learning for Scaling and Data Replication in Cloud Systems
by: Mokadem, Riad, et al.
Published: (2024)

Hierarchical Autoscaling for Large Language Model Serving with Chiron
by: Patke, Archit, et al.
Published: (2025)

Can Large Language Models Write Parallel Code?
by: Nichols, Daniel, et al.
Published: (2024)