Saved in:
| Main Authors: | Qian, Shangshu, Tan, Lin, Zhang, Yongle |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.26529 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Survey on Failure Analysis and Fault Injection in AI Systems
by: Yu, Guangba, et al.
Published: (2024)
by: Yu, Guangba, et al.
Published: (2024)
L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis
by: Jiang, Zhihan, et al.
Published: (2025)
by: Jiang, Zhihan, et al.
Published: (2025)
A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks
by: Vogel, Adriano, et al.
Published: (2024)
by: Vogel, Adriano, et al.
Published: (2024)
High-level Stream Processing: A Complementary Analysis of Fault Recovery
by: Vogel, Adriano, et al.
Published: (2024)
by: Vogel, Adriano, et al.
Published: (2024)
Learning Recovery Strategies for Dynamic Self-healing in Reactive Systems
by: Sanabria, Mateo, et al.
Published: (2024)
by: Sanabria, Mateo, et al.
Published: (2024)
Root Cause Analysis for Microservice Systems via Cascaded Conditional Learning with Hypergraphs
by: Xie, Shuaiyu, et al.
Published: (2025)
by: Xie, Shuaiyu, et al.
Published: (2025)
UPC Sentinel: An Accurate Approach for Detecting Upgradeability Proxy Contracts in Ethereum
by: Ebrahimi, Amir M., et al.
Published: (2024)
by: Ebrahimi, Amir M., et al.
Published: (2024)
Multi-Objective Load Balancing for Heterogeneous Edge-Based Object Detection Systems
by: Alqahtani, Daghash K., et al.
Published: (2026)
by: Alqahtani, Daghash K., et al.
Published: (2026)
MPI Errors Detection using GNN Embedding and Vector Embedding over LLVM IR
by: Karchi, Jad El, et al.
Published: (2024)
by: Karchi, Jad El, et al.
Published: (2024)
MegaFlow: Large-Scale Distributed Orchestration System for the Agentic Era
by: Zhang, Lei, et al.
Published: (2026)
by: Zhang, Lei, et al.
Published: (2026)
Domain Adaptation-based Edge Computing for Cross-Conditions Fault Diagnosis
by: Wang, Yanzhi, et al.
Published: (2024)
by: Wang, Yanzhi, et al.
Published: (2024)
A Reference Architecture for Governance of Cloud Native Applications
by: Pourmajidi, William, et al.
Published: (2023)
by: Pourmajidi, William, et al.
Published: (2023)
Wherefore Art Thou? Provenance-Guided Automatic Online Debugging with Lumos
by: Chen, Jingyuan, et al.
Published: (2026)
by: Chen, Jingyuan, et al.
Published: (2026)
A Framework for Effective Invocation Methods of Various LLM Services
by: Wang, Can, et al.
Published: (2024)
by: Wang, Can, et al.
Published: (2024)
Self-adaptive Multi-Access Edge Architectures: A Robotics Case
by: Moghaddam, Mahyar T, et al.
Published: (2026)
by: Moghaddam, Mahyar T, et al.
Published: (2026)
Supercharging Federated Learning with Flower and NVIDIA FLARE
by: Roth, Holger R., et al.
Published: (2024)
by: Roth, Holger R., et al.
Published: (2024)
Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs
by: Yu, Guangba, et al.
Published: (2026)
by: Yu, Guangba, et al.
Published: (2026)
Causal AI-based Root Cause Identification: Research to Practice at Scale
by: Jha, Saurabh, et al.
Published: (2025)
by: Jha, Saurabh, et al.
Published: (2025)
LLM-HPC++: Evaluating LLM-Generated Modern C++ and MPI+OpenMP Codes for Scalable Mandelbrot Set Computation
by: Diehl, Patrick, et al.
Published: (2025)
by: Diehl, Patrick, et al.
Published: (2025)
A Unifying Framework to Enable Artificial Intelligence in High Performance Computing Workflows
by: Domke, Jens, et al.
Published: (2025)
by: Domke, Jens, et al.
Published: (2025)
$μ$OpTime: Statically Reducing the Execution Time of Microbenchmark Suites Using Stability Metrics
by: Japke, Nils, et al.
Published: (2025)
by: Japke, Nils, et al.
Published: (2025)
AdaptiFlow: An Extensible Framework for Event-Driven Autonomy in Cloud Microservices
by: Ndadji, Brice Arléon Zemtsop, et al.
Published: (2025)
by: Ndadji, Brice Arléon Zemtsop, et al.
Published: (2025)
Do Large Language Models Understand Performance Optimization?
by: Cui, Bowen, et al.
Published: (2025)
by: Cui, Bowen, et al.
Published: (2025)
Umbilical Choir: Automated Live Testing for Edge-To-Cloud FaaS Applications
by: Malekabbasi, Mohammadreza, et al.
Published: (2025)
by: Malekabbasi, Mohammadreza, et al.
Published: (2025)
Container-level Energy Observability in Kubernetes Clusters
by: Pijnacker, Bjorn, et al.
Published: (2025)
by: Pijnacker, Bjorn, et al.
Published: (2025)
FlowUnits: Extending Dataflow for the Edge-to-Cloud Computing Continuum
by: Chini, Fabio, et al.
Published: (2025)
by: Chini, Fabio, et al.
Published: (2025)
SoK: Microservice Architectures from a Dependability Perspective
by: Kažemaks, Dāvis, et al.
Published: (2025)
by: Kažemaks, Dāvis, et al.
Published: (2025)
An Analysis of HPC and Edge Architectures in the Cloud
by: Santillan, Steven, et al.
Published: (2025)
by: Santillan, Steven, et al.
Published: (2025)
LLM4FaaS: No-Code Application Development using LLMs and FaaS
by: Wang, Minghe, et al.
Published: (2025)
by: Wang, Minghe, et al.
Published: (2025)
Investigating Matrix Repartitioning to Address the Over- and Undersubscription Challenge for a GPU-based CFD Solver
by: Olenik, Gregor, et al.
Published: (2025)
by: Olenik, Gregor, et al.
Published: (2025)
Radon: a Programming Model and Platform for Computing Continuum Systems
by: De Martini, Luca, et al.
Published: (2025)
by: De Martini, Luca, et al.
Published: (2025)
A Large-Scale Exploratory Study on the Proxy Pattern in Ethereum
by: Ebrahimi, Amir M., et al.
Published: (2025)
by: Ebrahimi, Amir M., et al.
Published: (2025)
Supporting Long-term Transactions in Smart Contracts Generated from Business Process Model and Notation (BPMN) Models
by: Liu, Christian Gang
Published: (2025)
by: Liu, Christian Gang
Published: (2025)
FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping
by: Bosbach, Nils, et al.
Published: (2025)
by: Bosbach, Nils, et al.
Published: (2025)
FSM Modeling For Off-Blockchain Computation
by: Liu, Christian Gang
Published: (2025)
by: Liu, Christian Gang
Published: (2025)
Addressing Reproducibility Challenges in HPC with Continuous Integration
by: Hayot-Sasson, Valérie, et al.
Published: (2025)
by: Hayot-Sasson, Valérie, et al.
Published: (2025)
Investigating the Impact of Isolation on Synchronized Benchmarks
by: Japke, Nils, et al.
Published: (2025)
by: Japke, Nils, et al.
Published: (2025)
An SLO Driven and Cost-Aware Autoscaling Framework for Kubernetes
by: Punniyamoorthy, Vinoth, et al.
Published: (2025)
by: Punniyamoorthy, Vinoth, et al.
Published: (2025)
Complexity at Scale: A Quantitative Analysis of an Alibaba Microservice Deployment
by: Winchester, Giles, et al.
Published: (2025)
by: Winchester, Giles, et al.
Published: (2025)
Proceedings First Workshop on Adaptable Cloud Architectures
by: De Palma, Giuseppe, et al.
Published: (2025)
by: De Palma, Giuseppe, et al.
Published: (2025)
Similar Items
-
A Survey on Failure Analysis and Fault Injection in AI Systems
by: Yu, Guangba, et al.
Published: (2024) -
L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis
by: Jiang, Zhihan, et al.
Published: (2025) -
A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks
by: Vogel, Adriano, et al.
Published: (2024) -
High-level Stream Processing: A Complementary Analysis of Fault Recovery
by: Vogel, Adriano, et al.
Published: (2024) -
Learning Recovery Strategies for Dynamic Self-healing in Reactive Systems
by: Sanabria, Mateo, et al.
Published: (2024)