Saved in:
| Main Authors: | Hu, Rizhen, He, Yutong, Yan, Ran, Sun, Mou, Yuan, Binghang, Yuan, Kun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.16415 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Byzantine Fault-Tolerant Min-Max Optimization
by: Liu, Shuo, et al.
Published: (2022)
by: Liu, Shuo, et al.
Published: (2022)
Approximate Byzantine Fault-Tolerance in Distributed Optimization
by: Liu, Shuo, et al.
Published: (2021)
by: Liu, Shuo, et al.
Published: (2021)
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
by: Zhou, Yuhang, et al.
Published: (2025)
by: Zhou, Yuhang, et al.
Published: (2025)
Optimizing Robot Dispersion on Grids: with and without Fault Tolerance
by: Banerjee, Rik, et al.
Published: (2024)
by: Banerjee, Rik, et al.
Published: (2024)
Unbiased Compression Saves Communication in Distributed Optimization: When and How Much?
by: He, Yutong, et al.
Published: (2023)
by: He, Yutong, et al.
Published: (2023)
Optimizing View Change for Byzantine Fault Tolerance in Parallel Consensus
by: Xie, Yifei, et al.
Published: (2026)
by: Xie, Yifei, et al.
Published: (2026)
Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing
by: Wang, Yuxin, et al.
Published: (2023)
by: Wang, Yuxin, et al.
Published: (2023)
Beyond Optimal Fault Tolerance
by: Lewis-Pye, Andrew, et al.
Published: (2025)
by: Lewis-Pye, Andrew, et al.
Published: (2025)
FTI-TMR: A Fault Tolerance and Isolation Algorithm for Interconnected Multicore Systems
by: Hu, Yiming
Published: (2025)
by: Hu, Yiming
Published: (2025)
Byzantine Fault Tolerant Causal Ordering
by: Misra, Anshuman, et al.
Published: (2021)
by: Misra, Anshuman, et al.
Published: (2021)
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
by: Liang, Yan, et al.
Published: (2026)
by: Liang, Yan, et al.
Published: (2026)
Optimal Fault-Tolerant Dispersion on Oriented Grids
by: Banerjee, Rik, et al.
Published: (2024)
by: Banerjee, Rik, et al.
Published: (2024)
Probabilistic Byzantine Fault Tolerance (Extended Version)
by: Avelãs, Diogo, et al.
Published: (2024)
by: Avelãs, Diogo, et al.
Published: (2024)
Stabl: Blockchain Fault Tolerance
by: Gramoli, Vincent, et al.
Published: (2024)
by: Gramoli, Vincent, et al.
Published: (2024)
HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware
by: Yan, Ran, et al.
Published: (2024)
by: Yan, Ran, et al.
Published: (2024)
VBFT: Veloce Byzantine Fault Tolerant Consensus for Blockchains
by: Jalalzai, Mohammad M., et al.
Published: (2023)
by: Jalalzai, Mohammad M., et al.
Published: (2023)
A Fault Tolerance Mechanism for Hybrid Scientific Workflows
by: Mulone, Alberto, et al.
Published: (2024)
by: Mulone, Alberto, et al.
Published: (2024)
Asynchronous Fault-Tolerant Distributed Proper Coloring of Graphs
by: Balliu, Alkida, et al.
Published: (2024)
by: Balliu, Alkida, et al.
Published: (2024)
Arma: Byzantine Fault Tolerant Consensus with Horizontal Scalability
by: Manevich, Yacov, et al.
Published: (2024)
by: Manevich, Yacov, et al.
Published: (2024)
The Case for ABI Interoperability in a Fault Tolerant MPI
by: Xu, Yao, et al.
Published: (2025)
by: Xu, Yao, et al.
Published: (2025)
HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment
by: Jiang, Youhe, et al.
Published: (2025)
by: Jiang, Youhe, et al.
Published: (2025)
Training LLMs with Fault Tolerant HSDP on 100,000 GPUs
by: Salpekar, Omkar, et al.
Published: (2026)
by: Salpekar, Omkar, et al.
Published: (2026)
Role-Based Fault Tolerance System for LLM RL Post-Training
by: Chen, Zhenqian, et al.
Published: (2025)
by: Chen, Zhenqian, et al.
Published: (2025)
Lower Bounds and Accelerated Algorithms in Distributed Stochastic Optimization with Communication Compression
by: He, Yutong, et al.
Published: (2023)
by: He, Yutong, et al.
Published: (2023)
A Byzantine Fault Tolerance Approach towards AI Safety
by: deVadoss, John, et al.
Published: (2025)
by: deVadoss, John, et al.
Published: (2025)
Hamster: A Fast Synchronous Byzantine Fault Tolerance Protocol
by: Fu, Ximing, et al.
Published: (2024)
by: Fu, Ximing, et al.
Published: (2024)
On Fault Tolerance of Data Storage Systems: A Holistic Perspective
by: Zheng, Mai, et al.
Published: (2025)
by: Zheng, Mai, et al.
Published: (2025)
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
by: Liu, Ziyue, et al.
Published: (2026)
by: Liu, Ziyue, et al.
Published: (2026)
LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices
by: Sun, Mingyu, et al.
Published: (2025)
by: Sun, Mingyu, et al.
Published: (2025)
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
by: Jiang, Youhe, et al.
Published: (2026)
by: Jiang, Youhe, et al.
Published: (2026)
Asynchronous Fault-Tolerant Language Decidability for Runtime Verification of Distributed Systems
by: Castañeda, Armando, et al.
Published: (2025)
by: Castañeda, Armando, et al.
Published: (2025)
Imitater: An Efficient Shared Mempool Protocol with Application to Byzantine Fault Tolerance
by: Zeng, Qingming, et al.
Published: (2024)
by: Zeng, Qingming, et al.
Published: (2024)
Schedule-Level Shared-Prefix Reuse for LLM RL Training
by: Li, Pengbo, et al.
Published: (2026)
by: Li, Pengbo, et al.
Published: (2026)
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
by: Han, Shujie, et al.
Published: (2026)
by: Han, Shujie, et al.
Published: (2026)
Fault-Tolerant Decentralized Distributed Asynchronous Federated Learning with Adaptive Termination Detection
by: Akkinepally, Phani Sahasra, et al.
Published: (2025)
by: Akkinepally, Phani Sahasra, et al.
Published: (2025)
Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks
by: Camarero, Cristóbal, et al.
Published: (2024)
by: Camarero, Cristóbal, et al.
Published: (2024)
Process-Commutative Distributed Objects: From Cryptocurrencies to Byzantine-Fault-Tolerant CRDTs
by: Frey, Davide, et al.
Published: (2023)
by: Frey, Davide, et al.
Published: (2023)
SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs
by: Lee, Jin, et al.
Published: (2026)
by: Lee, Jin, et al.
Published: (2026)
DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training
by: Wang, Zhixin, et al.
Published: (2025)
by: Wang, Zhixin, et al.
Published: (2025)
An All-Reduce Compatible Top-K Compressor for Communication-Efficient Distributed Learning
by: Chen, Chuyan, et al.
Published: (2025)
by: Chen, Chuyan, et al.
Published: (2025)
Similar Items
-
Byzantine Fault-Tolerant Min-Max Optimization
by: Liu, Shuo, et al.
Published: (2022) -
Approximate Byzantine Fault-Tolerance in Distributed Optimization
by: Liu, Shuo, et al.
Published: (2021) -
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
by: Zhou, Yuhang, et al.
Published: (2025) -
Optimizing Robot Dispersion on Grids: with and without Fault Tolerance
by: Banerjee, Rik, et al.
Published: (2024) -
Unbiased Compression Saves Communication in Distributed Optimization: When and How Much?
by: He, Yutong, et al.
Published: (2023)