保存先:
| 主要な著者: | Sun, Minqiu, Huang, Xin, Guo, Luanzheng, Tallent, Nathan R., Sato, Kento, Dai, Dong |
|---|---|
| フォーマット: | Preprint |
| 出版事項: |
2026
|
| 主題: | |
| オンライン・アクセス: | https://arxiv.org/abs/2602.22158 |
| タグ: |
タグ追加
タグなし, このレコードへの初めてのタグを付けませんか!
|
類似資料
Scrutinizing Variables for Checkpoint Using Automatic Differentiation
著者:: Huang, Xin, 等
出版事項: (2026)
著者:: Huang, Xin, 等
出版事項: (2026)
On The Reproducibility Limitations of RAG Systems
著者:: Wang, Baiqiang, 等
出版事項: (2025)
著者:: Wang, Baiqiang, 等
出版事項: (2025)
PowerTrip: Exploiting Federated Heterogeneous Datacenter Power for Distributed ML Training
著者:: Mehboob, Talha, 等
出版事項: (2025)
著者:: Mehboob, Talha, 等
出版事項: (2025)
QoSFlow: Ensuring Service Quality of Distributed Workflows Using Interpretable Sensitivity Models
著者:: Rashid, Md Hasanur, 等
出版事項: (2026)
著者:: Rashid, Md Hasanur, 等
出版事項: (2026)
Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi-threaded Programs
著者:: Fu, Xiang, 等
出版事項: (2026)
著者:: Fu, Xiang, 等
出版事項: (2026)
CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems
著者:: Rashid, Md Hasanur, 等
出版事項: (2026)
著者:: Rashid, Md Hasanur, 等
出版事項: (2026)
ParaLog: Consistent Host-side Logging for Parallel Checkpoints
著者:: Chien, Steven W. D., 等
出版事項: (2024)
著者:: Chien, Steven W. D., 等
出版事項: (2024)
MassiveGNN: Efficient Training via Prefetching for Massively Connected Distributed Graphs
著者:: Sarkar, Aishwarya, 等
出版事項: (2024)
著者:: Sarkar, Aishwarya, 等
出版事項: (2024)
Memory-Efficient Federated Fine-Tuning of Large Language Models via Layer Pruning
著者:: Wu, Yebo, 等
出版事項: (2025)
著者:: Wu, Yebo, 等
出版事項: (2025)
Overcoming Memory Constraints in Quantum Circuit Simulation with a High-Fidelity Compression Framework
著者:: Zhang, Boyuan, 等
出版事項: (2024)
著者:: Zhang, Boyuan, 等
出版事項: (2024)
NOMAD: Generating Embeddings for Massive Distributed Graphs
著者:: Sarkar, Aishwarya, 等
出版事項: (2026)
著者:: Sarkar, Aishwarya, 等
出版事項: (2026)
Understanding Power Consumption Metric on Heterogeneous Memory Systems
著者:: Proaño, Andrès Rubio, 等
出版事項: (2024)
著者:: Proaño, Andrès Rubio, 等
出版事項: (2024)
Improving SpGEMM Performance Through Matrix Reordering and Cluster-wise Computation
著者:: Islam, Abdullah Al Raqibul, 等
出版事項: (2025)
著者:: Islam, Abdullah Al Raqibul, 等
出版事項: (2025)
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
著者:: Han, Shujie, 等
出版事項: (2026)
著者:: Han, Shujie, 等
出版事項: (2026)
Efficient LLM Inference with Activation Checkpointing and Hybrid Caching
著者:: Lee, Sanghyeon, 等
出版事項: (2025)
著者:: Lee, Sanghyeon, 等
出版事項: (2025)
FedQuad: Adaptive Layer-wise LoRA Deployment and Activation Quantization for Federated Fine-Tuning
著者:: Li, Rukuo, 等
出版事項: (2025)
著者:: Li, Rukuo, 等
出版事項: (2025)
DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models
著者:: Maurya, Avinash, 等
出版事項: (2024)
著者:: Maurya, Avinash, 等
出版事項: (2024)
InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding
著者:: Chen, Qiaoling, 等
出版事項: (2024)
著者:: Chen, Qiaoling, 等
出版事項: (2024)
Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis
著者:: Lian, Xinyu, 等
出版事項: (2024)
著者:: Lian, Xinyu, 等
出版事項: (2024)
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
著者:: Duan, Jiangfei, 等
出版事項: (2024)
著者:: Duan, Jiangfei, 等
出版事項: (2024)
SLO-Aware Scheduling for Large Language Model Inferences
著者:: Huang, Jinqi, 等
出版事項: (2025)
著者:: Huang, Jinqi, 等
出版事項: (2025)
DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
著者:: Tang, Zhenheng, 等
出版事項: (2025)
著者:: Tang, Zhenheng, 等
出版事項: (2025)
Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing
著者:: Wang, Yuxin, 等
出版事項: (2023)
著者:: Wang, Yuxin, 等
出版事項: (2023)
Asynchronous Checkpoint for Eventually Consistent Databases
著者:: Ravishankar, Raaghav, 等
出版事項: (2025)
著者:: Ravishankar, Raaghav, 等
出版事項: (2025)
Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication
著者:: Bhardwaj, Ankit, 等
出版事項: (2025)
著者:: Bhardwaj, Ankit, 等
出版事項: (2025)
Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management
著者:: Qianli, Liu, 等
出版事項: (2025)
著者:: Qianli, Liu, 等
出版事項: (2025)
Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training
著者:: Sun, Ao, 等
出版事項: (2024)
著者:: Sun, Ao, 等
出版事項: (2024)
CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads
著者:: Stoyanov, Radostin, 等
出版事項: (2025)
著者:: Stoyanov, Radostin, 等
出版事項: (2025)
Optimal Checkpoint Interval with Availability as an Objective Function
著者:: Saxena, Nirmal Raj, 等
出版事項: (2024)
著者:: Saxena, Nirmal Raj, 等
出版事項: (2024)
Checkpoint and Restart: An Energy Consumption Characterization in Clusters
著者:: Moran, Marina, 等
出版事項: (2024)
著者:: Moran, Marina, 等
出版事項: (2024)
SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference
著者:: Zhao, Yihao, 等
出版事項: (2025)
著者:: Zhao, Yihao, 等
出版事項: (2025)
Cascadia: An Efficient Cascade Serving System for Large Language Models
著者:: Jiang, Youhe, 等
出版事項: (2025)
著者:: Jiang, Youhe, 等
出版事項: (2025)
Sparse Checkpointing for Fast and Reliable MoE Training
著者:: Gandhi, Swapnil, 等
出版事項: (2024)
著者:: Gandhi, Swapnil, 等
出版事項: (2024)
DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models
著者:: Zhang, Zili, 等
出版事項: (2024)
著者:: Zhang, Zili, 等
出版事項: (2024)
FedAPTA: Federated Multi-task Learning for Heterogeneous Devices with Adaptive Layer-wise Pruning and Task-aware Aggregation
著者:: Yu, Zhen, 等
出版事項: (2025)
著者:: Yu, Zhen, 等
出版事項: (2025)
Pier: Efficient Large Language Model pretraining with Relaxed Global Communication
著者:: Fan, Shuyuan, 等
出版事項: (2025)
著者:: Fan, Shuyuan, 等
出版事項: (2025)
CRIU -- Checkpoint Restore in Userspace for computational simulations and scientific applications
著者:: Andrijauskas, Fabio, 等
出版事項: (2024)
著者:: Andrijauskas, Fabio, 等
出版事項: (2024)
Understanding LLM Checkpoint/Restore I/O Strategies and Patterns
著者:: Gossman, Mikaila J., 等
出版事項: (2025)
著者:: Gossman, Mikaila J., 等
出版事項: (2025)
FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving
著者:: Gao, Shouwei, 等
出版事項: (2026)
著者:: Gao, Shouwei, 等
出版事項: (2026)
MoLink: Distributed and Efficient Serving Framework for Large Models
著者:: Jin, Lewei, 等
出版事項: (2025)
著者:: Jin, Lewei, 等
出版事項: (2025)
類似資料
-
Scrutinizing Variables for Checkpoint Using Automatic Differentiation
著者:: Huang, Xin, 等
出版事項: (2026) -
On The Reproducibility Limitations of RAG Systems
著者:: Wang, Baiqiang, 等
出版事項: (2025) -
PowerTrip: Exploiting Federated Heterogeneous Datacenter Power for Distributed ML Training
著者:: Mehboob, Talha, 等
出版事項: (2025) -
QoSFlow: Ensuring Service Quality of Distributed Workflows Using Interpretable Sensitivity Models
著者:: Rashid, Md Hasanur, 等
出版事項: (2026) -
Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi-threaded Programs
著者:: Fu, Xiang, 等
出版事項: (2026)