:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wu, Zihan, Huang, Zhaoke, Yan, Hong
Format:	Preprint
Published:	2024
Subjects:	Distributed, Parallel, and Cluster Computing Machine Learning H.2.8
Online Access:	https://arxiv.org/abs/2410.18113
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Challenges of Heterogeneity in Big Data: A Comparative Study of Classification in Large-Scale Structured and Unstructured Domains
by: Eduardo, González Trigueros Jesús, et al.
Published: (2025)

Heuristic Search Space Partitioning for Low-Latency Multi-Tenant Cloud Queries
by: Pathak, Prashant Kumar, et al.
Published: (2026)

Spark-LLM-Eval: A Distributed Framework for Statistically Rigorous Large Language Model Evaluation
by: Mitra, Subhadip
Published: (2026)

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling
by: Jin, Yihong, et al.
Published: (2025)

Cost-Aware Logging: Measuring the Financial Impact of Excessive Log Retention in Small-Scale Cloud Deployments
by: Putra, Jody Almaida
Published: (2026)

A Semantic Partitioning Method for Large-Scale Training of Knowledge Graph Embeddings
by: Bai, Yuhe
Published: (2025)

Revisiting Reliability in Large-Scale Machine Learning Research Clusters
by: Kokolis, Apostolos, et al.
Published: (2024)

Learning Interpretable Scheduling Algorithms for Data Processing Clusters
by: Hu, Zhibo, et al.
Published: (2024)

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference
by: Gao, Yuxuan, et al.
Published: (2026)

AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training
by: Bai, Huawei, et al.
Published: (2025)

SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
by: Du, Zhixu, et al.
Published: (2023)

Deploy, Calibrate, Monitor, Heal -- No Human Required: An Autonomous AI SRE Agent for Elasticsearch
by: Mukkolakkal, Muhamed Ramees Cheriya
Published: (2026)

Scalable High-Dimensional Multivariate Linear Regression for Feature-Distributed Data
by: Huang, Shuo-Chieh, et al.
Published: (2023)

Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design
by: Xue, Chunyu, et al.
Published: (2024)

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale
by: Won, William, et al.
Published: (2023)

Learning the Optimal Path and DNN Partition for Collaborative Edge Inference
by: Huang, Yin, et al.
Published: (2024)

Decoupled Vertical Federated Learning for Practical Training on Vertically Partitioned Data
by: Amalanshu, Avi, et al.
Published: (2024)

Efficient Construction of Large Search Spaces for Auto-Tuning
by: Willemsen, Floris-Jan, et al.
Published: (2025)

EmbedPart: Embedding-Driven Graph Partitioning for Scalable Graph Neural Network Training
by: Merkel, Nikolai, et al.
Published: (2026)

EncCluster: Scalable Functional Encryption in Federated Learning through Weight Clustering and Probabilistic Filters
by: Tsouvalas, Vasileios, et al.
Published: (2024)

Rethinking Personalized Federated Learning with Clustering-based Dynamic Graph Propagation
by: Wang, Jiaqi, et al.
Published: (2024)

FedClust: Optimizing Federated Learning on Non-IID Data through Weight-Driven Client Clustering
by: Islam, Md Sirajul, et al.
Published: (2024)

Cross-Silo Federated Learning for Multi-Tier Networks with Vertical and Horizontal Data Partitioning
by: Das, Anirban, et al.
Published: (2021)

Communication-Efficient Hybrid Federated Learning for E-health with Horizontal and Vertical Data Partitioning
by: Yu, Chong, et al.
Published: (2024)

SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data
by: Kapadia, Shashank, et al.
Published: (2026)

Large-Scale Graph Building in Dynamic Environments: Low Latency and High Quality
by: de Almeida, Filipe Miguel Gonçalves, et al.
Published: (2025)

Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training
by: Liu, Guanliang, et al.
Published: (2026)

FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems
by: Woisetschläger, Herbert, et al.
Published: (2023)

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
by: Zhao, Juntao, et al.
Published: (2024)

Flexible Clustered Federated Learning for Client-Level Data Distribution Shift
by: Duan, Moming, et al.
Published: (2021)

N2N: A Parallel Framework for Large-Scale MILP under Distributed Memory
by: Wang, Longfei, et al.
Published: (2025)

Operational Memory Architecture for Kubernetes:Preserving Causal Context Across the Evidence Horizon
by: Khan, Shamsher
Published: (2026)

A Semi-Supervised Federated Learning Framework with Hierarchical Clustering Aggregation for Heterogeneous Satellite Networks
by: Liu, Zhuocheng, et al.
Published: (2025)

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration
by: Jin, Hongpeng, et al.
Published: (2024)

Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User-Friendly Scaling Library
by: Wang, Weixun, et al.
Published: (2025)

AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training
by: Chen, Ling, et al.
Published: (2026)

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
by: Jiang, Ziheng, et al.
Published: (2024)

Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach
by: Saroliya, Urvij, et al.
Published: (2024)

DualSparse-MoE: Coordinating Tensor/Neuron-Level Sparsity with Expert Partition and Reconstruction
by: Cai, Weilin, et al.
Published: (2025)

Hydraulis: Balancing Large Transformer Model Training via Co-designing Parallel Strategies and Data Assignment
by: Li, Haoyang, et al.
Published: (2024)