Guardado en:
| Autores principales: | Interrante-Grant, Alexander, Varela-Rosa, Carla, Narayan, Suhaas, Connelly, Chris, Reuther, Albert |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2509.05258 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
por: Xu, Lang, et al.
Publicado: (2025)
por: Xu, Lang, et al.
Publicado: (2025)
Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
por: Ifath, Md. Monzurul Amin, et al.
Publicado: (2026)
por: Ifath, Md. Monzurul Amin, et al.
Publicado: (2026)
Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer
por: Vooturi, Dharma Teja, et al.
Publicado: (2026)
por: Vooturi, Dharma Teja, et al.
Publicado: (2026)
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
por: Zhao, Juntao, et al.
Publicado: (2025)
por: Zhao, Juntao, et al.
Publicado: (2025)
Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study
por: Wiesner, Philipp, et al.
Publicado: (2026)
por: Wiesner, Philipp, et al.
Publicado: (2026)
Can Large Language Models Predict Parallel Code Performance?
por: Bolet, Gregory, et al.
Publicado: (2025)
por: Bolet, Gregory, et al.
Publicado: (2025)
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
por: Liang, Mingyu, et al.
Publicado: (2025)
por: Liang, Mingyu, et al.
Publicado: (2025)
HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models
por: Xu, Si, et al.
Publicado: (2024)
por: Xu, Si, et al.
Publicado: (2024)
Hierarchical Autoscaling for Large Language Model Serving with Chiron
por: Patke, Archit, et al.
Publicado: (2025)
por: Patke, Archit, et al.
Publicado: (2025)
Can Large Language Models Write Parallel Code?
por: Nichols, Daniel, et al.
Publicado: (2024)
por: Nichols, Daniel, et al.
Publicado: (2024)
Efficient Multi-Model Orchestration for Self-Hosted Large Language Models
por: Vangala, Bhanu Prakash, et al.
Publicado: (2025)
por: Vangala, Bhanu Prakash, et al.
Publicado: (2025)
HPC-Coder: Modeling Parallel Programs using Large Language Models
por: Nichols, Daniel, et al.
Publicado: (2023)
por: Nichols, Daniel, et al.
Publicado: (2023)
FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing
por: Liu, Xiao-Yang, et al.
Publicado: (2024)
por: Liu, Xiao-Yang, et al.
Publicado: (2024)
Large Language Model Partitioning for Low-Latency Inference at the Edge
por: Kafetzis, Dimitrios, et al.
Publicado: (2025)
por: Kafetzis, Dimitrios, et al.
Publicado: (2025)
Equinox: Holistic Fair Scheduling in Serving Large Language Models
por: Wei, Zhixiang, et al.
Publicado: (2025)
por: Wei, Zhixiang, et al.
Publicado: (2025)
Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
por: Li, Rui, et al.
Publicado: (2025)
por: Li, Rui, et al.
Publicado: (2025)
Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference
por: He, Zifan, et al.
Publicado: (2026)
por: He, Zifan, et al.
Publicado: (2026)
Accelerating Large Language Model Training with Hybrid GPU-based Compression
por: Xu, Lang, et al.
Publicado: (2024)
por: Xu, Lang, et al.
Publicado: (2024)
ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management
por: Pan, Zaifeng, et al.
Publicado: (2026)
por: Pan, Zaifeng, et al.
Publicado: (2026)
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
por: Gu, Yida, et al.
Publicado: (2026)
por: Gu, Yida, et al.
Publicado: (2026)
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
por: Xu, Jiaming, et al.
Publicado: (2025)
por: Xu, Jiaming, et al.
Publicado: (2025)
AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure
por: The AIBrix Team, et al.
Publicado: (2025)
por: The AIBrix Team, et al.
Publicado: (2025)
Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments
por: Jin, Yihong, et al.
Publicado: (2025)
por: Jin, Yihong, et al.
Publicado: (2025)
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
por: Han, Shujie, et al.
Publicado: (2026)
por: Han, Shujie, et al.
Publicado: (2026)
TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference
por: Papaioannou, Konstantinos, et al.
Publicado: (2026)
por: Papaioannou, Konstantinos, et al.
Publicado: (2026)
A Survey on Large Language Model Acceleration based on KV Cache Management
por: Li, Haoyang, et al.
Publicado: (2024)
por: Li, Haoyang, et al.
Publicado: (2024)
Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems
por: Yang, Haowei, et al.
Publicado: (2025)
por: Yang, Haowei, et al.
Publicado: (2025)
Federated Fine-Tuning of Sparsely-Activated Large Language Models on Resource-Constrained Devices
por: Chen, Fahao, et al.
Publicado: (2025)
por: Chen, Fahao, et al.
Publicado: (2025)
Connecting Large Language Model Agent to High Performance Computing Resource
por: Ma, Heng, et al.
Publicado: (2025)
por: Ma, Heng, et al.
Publicado: (2025)
Leveraging Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms
por: Ji, Cheng, et al.
Publicado: (2025)
por: Ji, Cheng, et al.
Publicado: (2025)
Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
por: Zhu, Zhanda, et al.
Publicado: (2025)
por: Zhu, Zhanda, et al.
Publicado: (2025)
EDiT: A Local-SGD-Based Efficient Distributed Training Method for Large Language Models
por: Cheng, Jialiang, et al.
Publicado: (2024)
por: Cheng, Jialiang, et al.
Publicado: (2024)
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference
por: Huang, Yafan, et al.
Publicado: (2026)
por: Huang, Yafan, et al.
Publicado: (2026)
Connecting Large Language Models with Blockchain: Advancing the Evolution of Smart Contracts from Automation to Intelligence
por: Xian, Youquan, et al.
Publicado: (2024)
por: Xian, Youquan, et al.
Publicado: (2024)
Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
por: Lu, Ning, et al.
Publicado: (2024)
por: Lu, Ning, et al.
Publicado: (2024)
Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data
por: Mudgal, Priyanka, et al.
Publicado: (2024)
por: Mudgal, Priyanka, et al.
Publicado: (2024)
Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey
por: Liang, Feng, et al.
Publicado: (2024)
por: Liang, Feng, et al.
Publicado: (2024)
PlanetServe: A Decentralized, Scalable, and Privacy-Preserving Overlay for Democratizing Large Language Model Serving
por: Fang, Fei, et al.
Publicado: (2025)
por: Fang, Fei, et al.
Publicado: (2025)
Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey
por: Liang, Feng, et al.
Publicado: (2024)
por: Liang, Feng, et al.
Publicado: (2024)
Cloud-Based AI Systems: Leveraging Large Language Models for Intelligent Fault Detection and Autonomous Self-Healing
por: Ji, Cheng, et al.
Publicado: (2025)
por: Ji, Cheng, et al.
Publicado: (2025)
Ejemplares similares
-
Scaling Large Language Model Training on Frontier with Low-Bandwidth Partitioning
por: Xu, Lang, et al.
Publicado: (2025) -
Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
por: Ifath, Md. Monzurul Amin, et al.
Publicado: (2026) -
Scalable Pretraining of Large Mixture of Experts Language Models on Aurora Super Computer
por: Vooturi, Dharma Teja, et al.
Publicado: (2026) -
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
por: Zhao, Juntao, et al.
Publicado: (2025) -
Distributed LLM Pretraining During Renewable Curtailment Windows: A Feasibility Study
por: Wiesner, Philipp, et al.
Publicado: (2026)