Saved in:
| Main Authors: | Bekman, Stas, Rajbhandari, Samyam, Wyatt, Michael, Rasley, Jeff, Ruwase, Tunji, Yao, Zhewei, Qiao, Aurick, He, Yuxiong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.13996 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025)
by: Hidayetoglu, Mert, et al.
Published: (2025)
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
by: Qiao, Aurick, et al.
Published: (2024)
by: Qiao, Aurick, et al.
Published: (2024)
Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025)
by: Rajbhandari, Samyam, et al.
Published: (2025)
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
by: Su, Zhaoyuan, et al.
Published: (2026)
by: Su, Zhaoyuan, et al.
Published: (2026)
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
by: Lee, Jaeseong, et al.
Published: (2024)
by: Lee, Jaeseong, et al.
Published: (2024)
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
by: Holmes, Connor, et al.
Published: (2024)
by: Holmes, Connor, et al.
Published: (2024)
OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs
by: Lee, Jaeseong, et al.
Published: (2025)
by: Lee, Jaeseong, et al.
Published: (2025)
Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis
by: Lian, Xinyu, et al.
Published: (2024)
by: Lian, Xinyu, et al.
Published: (2024)
AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism
by: Gupta, Ahan, et al.
Published: (2026)
by: Gupta, Ahan, et al.
Published: (2026)
Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment
by: Renc, Pawel, et al.
Published: (2025)
by: Renc, Pawel, et al.
Published: (2025)
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
by: Hu, Lanxiang, et al.
Published: (2025)
by: Hu, Lanxiang, et al.
Published: (2025)
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
by: Xia, Haojun, et al.
Published: (2024)
by: Xia, Haojun, et al.
Published: (2024)
FastPersist: Accelerating Model Checkpointing in Deep Learning
by: Wang, Guanhua, et al.
Published: (2024)
by: Wang, Guanhua, et al.
Published: (2024)
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
by: Li, Conglong, et al.
Published: (2022)
by: Li, Conglong, et al.
Published: (2022)
Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data
by: Cook, John, et al.
Published: (2026)
by: Cook, John, et al.
Published: (2026)
ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback
by: Zhai, Bohan, et al.
Published: (2025)
by: Zhai, Bohan, et al.
Published: (2025)
MedOrch: Medical Diagnosis with Tool-Augmented Reasoning Agents for Flexible Extensibility
by: He, Yexiao, et al.
Published: (2025)
by: He, Yexiao, et al.
Published: (2025)
LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
by: Gu, Diandian, et al.
Published: (2024)
by: Gu, Diandian, et al.
Published: (2024)
Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training
by: Luo, Cheng, et al.
Published: (2024)
by: Luo, Cheng, et al.
Published: (2024)
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
by: Yao, Jinghan, et al.
Published: (2024)
by: Yao, Jinghan, et al.
Published: (2024)
BurstEngine: an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
by: Sun, Ao, et al.
Published: (2025)
by: Sun, Ao, et al.
Published: (2025)
Learning to Hint for Reinforcement Learning
by: Xia, Yu, et al.
Published: (2026)
by: Xia, Yu, et al.
Published: (2026)
R$^3$-SQL: Ranking Reward and Resampling for Text-to-SQL
by: Han, Hojae, et al.
Published: (2026)
by: Han, Hojae, et al.
Published: (2026)
Communication-Efficient Sparsely-Activated Model Training via Sequence Migration and Token Condensation
by: Chen, Fahao, et al.
Published: (2024)
by: Chen, Fahao, et al.
Published: (2024)
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding
by: Zhang, Zhenyu, et al.
Published: (2024)
by: Zhang, Zhenyu, et al.
Published: (2024)
Learning to Self-Evolve
by: Chen, Xiaoyin, et al.
Published: (2026)
by: Chen, Xiaoyin, et al.
Published: (2026)
StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs
by: Luo, Qijun, et al.
Published: (2025)
by: Luo, Qijun, et al.
Published: (2025)
SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips
by: Lian, Xinyu, et al.
Published: (2025)
by: Lian, Xinyu, et al.
Published: (2025)
Inference Scaling for Bridging Retrieval and Augmented Generation
by: Lee, Youngwon, et al.
Published: (2024)
by: Lee, Youngwon, et al.
Published: (2024)
CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation
by: Lee, Youngwon, et al.
Published: (2024)
by: Lee, Youngwon, et al.
Published: (2024)
Context Parallelism for Scalable Million-Token Inference
by: Yang, Amy, et al.
Published: (2024)
by: Yang, Amy, et al.
Published: (2024)
Pretext Training Algorithms for Event Sequence Data
by: Wang, Yimu, et al.
Published: (2024)
by: Wang, Yimu, et al.
Published: (2024)
Training-free LLM-generated Text Detection by Mining Token Probability Sequences
by: Xu, Yihuai, et al.
Published: (2024)
by: Xu, Yihuai, et al.
Published: (2024)
Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL
by: Yao, Zhewei, et al.
Published: (2025)
by: Yao, Zhewei, et al.
Published: (2025)
SPPO:Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading
by: Chen, Qiaoling, et al.
Published: (2025)
by: Chen, Qiaoling, et al.
Published: (2025)
HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
by: Zhang, Geng, et al.
Published: (2025)
by: Zhang, Geng, et al.
Published: (2025)
Multi-word Tokenization for Sequence Compression
by: Gee, Leonidas, et al.
Published: (2024)
by: Gee, Leonidas, et al.
Published: (2024)
ComposeRAG: A Modular and Composable RAG for Corpus-Grounded Multi-Hop Question Answering
by: Wu, Ruofan, et al.
Published: (2025)
by: Wu, Ruofan, et al.
Published: (2025)
DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science
by: Shu, Fan, et al.
Published: (2026)
by: Shu, Fan, et al.
Published: (2026)
Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
by: Li, Wenhao, et al.
Published: (2026)
by: Li, Wenhao, et al.
Published: (2026)
Similar Items
-
Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads
by: Hidayetoglu, Mert, et al.
Published: (2025) -
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
by: Qiao, Aurick, et al.
Published: (2024) -
Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
by: Rajbhandari, Samyam, et al.
Published: (2025) -
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
by: Su, Zhaoyuan, et al.
Published: (2026) -
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
by: Lee, Jaeseong, et al.
Published: (2024)