Saved in:
| Main Authors: | Ji, Cheng, Luo, Huaiying |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.11743 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Leveraging Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms
by: Ji, Cheng, et al.
Published: (2025)
by: Ji, Cheng, et al.
Published: (2025)
Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments
by: Jin, Yihong, et al.
Published: (2025)
by: Jin, Yihong, et al.
Published: (2025)
Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
by: Xie, Zhiqiang, et al.
Published: (2024)
by: Xie, Zhiqiang, et al.
Published: (2024)
Intelligent Autonomous Orchestration for Distributed Cloud Resources using Complex-Stability Analysis
by: Shyam, Gopal Krishna, et al.
Published: (2026)
by: Shyam, Gopal Krishna, et al.
Published: (2026)
The AI_INFN Platform: Artificial Intelligence Development in the Cloud
by: Anderlini, Lucio, et al.
Published: (2025)
by: Anderlini, Lucio, et al.
Published: (2025)
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
by: Han, Shujie, et al.
Published: (2026)
by: Han, Shujie, et al.
Published: (2026)
Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure
by: Guo, Yongjian, et al.
Published: (2026)
by: Guo, Yongjian, et al.
Published: (2026)
AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research
by: Heredia, Ignacio, et al.
Published: (2025)
by: Heredia, Ignacio, et al.
Published: (2025)
Efficient Multi-Model Orchestration for Self-Hosted Large Language Models
by: Vangala, Bhanu Prakash, et al.
Published: (2025)
by: Vangala, Bhanu Prakash, et al.
Published: (2025)
Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems
by: Yang, Haowei, et al.
Published: (2025)
by: Yang, Haowei, et al.
Published: (2025)
EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
by: Cheng, Jialiang, et al.
Published: (2024)
by: Cheng, Jialiang, et al.
Published: (2024)
Connecting Large Language Models with Blockchain: Advancing the Evolution of Smart Contracts from Automation to Intelligence
by: Xian, Youquan, et al.
Published: (2024)
by: Xian, Youquan, et al.
Published: (2024)
HPRM: High-Performance Robotic Middleware for Intelligent Autonomous Systems
by: Kwok, Jacky, et al.
Published: (2024)
by: Kwok, Jacky, et al.
Published: (2024)
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
by: Chen, Xinquan, et al.
Published: (2026)
by: Chen, Xinquan, et al.
Published: (2026)
Scalable Cloud-Native Architectures for Intelligent PMU Data Processing
by: Chockalingam, Nachiappan, et al.
Published: (2025)
by: Chockalingam, Nachiappan, et al.
Published: (2025)
AI Inference as Relocatable Electricity Demand: A Latency-Constrained Energy-Geography Framework
by: Luo, Xubin, et al.
Published: (2026)
by: Luo, Xubin, et al.
Published: (2026)
Autonomous Systems Dependability in the era of AI: Design Challenges in Safety, Security, Reliability and Certification
by: Ranjbar, Behnaz, et al.
Published: (2026)
by: Ranjbar, Behnaz, et al.
Published: (2026)
A Self-Healing and Fault-Tolerant Cloud-based Digital Twin Processing Management Model
by: Saxena, Deepika, et al.
Published: (2025)
by: Saxena, Deepika, et al.
Published: (2025)
SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
by: Mao, Ziming, et al.
Published: (2024)
by: Mao, Ziming, et al.
Published: (2024)
Building AI Agents for Autonomous Clouds: Challenges and Design Principles
by: Shetty, Manish, et al.
Published: (2024)
by: Shetty, Manish, et al.
Published: (2024)
Enhancing Cluster Resilience: LLM-agent Based Autonomous Intelligent Cluster Diagnosis System and Evaluation Framework
by: Shi, Honghao, et al.
Published: (2024)
by: Shi, Honghao, et al.
Published: (2024)
Intelligent Load Balancing in Cloud Computer Systems
by: Sliwko, Leszek
Published: (2025)
by: Sliwko, Leszek
Published: (2025)
AI Factories: It's time to rethink the Cloud-HPC divide
by: Lopez, Pedro Garcia, et al.
Published: (2025)
by: Lopez, Pedro Garcia, et al.
Published: (2025)
Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection
by: Pasandideh, Faezeh, et al.
Published: (2026)
by: Pasandideh, Faezeh, et al.
Published: (2026)
Role-Based Fault Tolerance System for LLM RL Post-Training
by: Chen, Zhenqian, et al.
Published: (2025)
by: Chen, Zhenqian, et al.
Published: (2025)
AI-Driven Cloud Resource Optimization for Multi-Cluster Environments
by: Punniyamoorthy, Vinoth, et al.
Published: (2025)
by: Punniyamoorthy, Vinoth, et al.
Published: (2025)
A Survey on Failure Analysis and Fault Injection in AI Systems
by: Yu, Guangba, et al.
Published: (2024)
by: Yu, Guangba, et al.
Published: (2024)
Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
by: Lu, Ning, et al.
Published: (2024)
by: Lu, Ning, et al.
Published: (2024)
Adaptive AI-based Decentralized Resource Management in the Cloud-Edge Continuum
by: Li, Lanpei, et al.
Published: (2025)
by: Li, Lanpei, et al.
Published: (2025)
ClusterRCA: An End-to-End Approach for Network Fault Localization and Classification for HPC System
by: Sun, Yongqian, et al.
Published: (2025)
by: Sun, Yongqian, et al.
Published: (2025)
Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?
by: Kim, Taeyoon, et al.
Published: (2026)
by: Kim, Taeyoon, et al.
Published: (2026)
A Meta-Heuristic Load Balancer for Cloud Computing Systems
by: Sliwko, Leszek, et al.
Published: (2025)
by: Sliwko, Leszek, et al.
Published: (2025)
Scaling Performance of Large Language Model Pretraining
by: Interrante-Grant, Alexander, et al.
Published: (2025)
by: Interrante-Grant, Alexander, et al.
Published: (2025)
Keep Your Friends Close: Leveraging Affinity Groups to Accelerate AI Inference Workflows
by: Garrett, Thiago, et al.
Published: (2023)
by: Garrett, Thiago, et al.
Published: (2023)
Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data
by: Mudgal, Priyanka, et al.
Published: (2024)
by: Mudgal, Priyanka, et al.
Published: (2024)
ECCENTRIC: Edge-Cloud Collaboration Framework for Distributed Inference Using Knowledge Adaptation
by: Kamani, Mohammad Mahdi, et al.
Published: (2025)
by: Kamani, Mohammad Mahdi, et al.
Published: (2025)
The (R)evolution of Scientific Workflows in the Agentic AI Era: Towards Autonomous Science
by: Shin, Woong, et al.
Published: (2025)
by: Shin, Woong, et al.
Published: (2025)
Towards using Reinforcement Learning for Scaling and Data Replication in Cloud Systems
by: Mokadem, Riad, et al.
Published: (2024)
by: Mokadem, Riad, et al.
Published: (2024)
Hierarchical Autoscaling for Large Language Model Serving with Chiron
by: Patke, Archit, et al.
Published: (2025)
by: Patke, Archit, et al.
Published: (2025)
Can Large Language Models Write Parallel Code?
by: Nichols, Daniel, et al.
Published: (2024)
by: Nichols, Daniel, et al.
Published: (2024)
Similar Items
-
Leveraging Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms
by: Ji, Cheng, et al.
Published: (2025) -
Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments
by: Jin, Yihong, et al.
Published: (2025) -
Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
by: Xie, Zhiqiang, et al.
Published: (2024) -
Intelligent Autonomous Orchestration for Distributed Cloud Resources using Complex-Stability Analysis
by: Shyam, Gopal Krishna, et al.
Published: (2026) -
The AI_INFN Platform: Artificial Intelligence Development in the Cloud
by: Anderlini, Lucio, et al.
Published: (2025)