Saved in:
| Main Author: | Fu, Yao |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2405.08944 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Efficient Deployment of Large Language Models on Resource-constrained Devices
by: Yao, Zhiwei, et al.
Published: (2025)
by: Yao, Zhiwei, et al.
Published: (2025)
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
by: Yao, Jinghan, et al.
Published: (2024)
by: Yao, Jinghan, et al.
Published: (2024)
LongCat-Flash Technical Report
by: Meituan LongCat Team, et al.
Published: (2025)
by: Meituan LongCat Team, et al.
Published: (2025)
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
by: Xue, Fuzhao, et al.
Published: (2024)
by: Xue, Fuzhao, et al.
Published: (2024)
Deploying Atmospheric and Oceanic AI Models on Chinese Hardware and Framework: Migration Strategies, Performance Optimization and Analysis
by: Sun, Yuze, et al.
Published: (2025)
by: Sun, Yuze, et al.
Published: (2025)
MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment
by: Li, Yufei, et al.
Published: (2025)
by: Li, Yufei, et al.
Published: (2025)
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
by: Yang, Shang, et al.
Published: (2025)
by: Yang, Shang, et al.
Published: (2025)
Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
by: Agrawal, Amey, et al.
Published: (2024)
by: Agrawal, Amey, et al.
Published: (2024)
Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning
by: Singhal, Raghav, et al.
Published: (2025)
by: Singhal, Raghav, et al.
Published: (2025)
Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces
by: Wei, Anjiang, et al.
Published: (2024)
by: Wei, Anjiang, et al.
Published: (2024)
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
by: Mei, Zhiyu, et al.
Published: (2024)
by: Mei, Zhiyu, et al.
Published: (2024)
FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing
by: Liu, Xiao-Yang, et al.
Published: (2024)
by: Liu, Xiao-Yang, et al.
Published: (2024)
Optimizing the Deployment of Tiny Transformers on Low-Power MCUs
by: Jung, Victor J. B., et al.
Published: (2024)
by: Jung, Victor J. B., et al.
Published: (2024)
From Centralized to Decentralized Federated Learning: Theoretical Insights, Privacy Preservation, and Robustness Challenges
by: Li, Qiongxiu, et al.
Published: (2025)
by: Li, Qiongxiu, et al.
Published: (2025)
Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
by: Deshmukh, Dhruv, et al.
Published: (2025)
by: Deshmukh, Dhruv, et al.
Published: (2025)
Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques
by: Behera, Adarsh Prasad, et al.
Published: (2025)
by: Behera, Adarsh Prasad, et al.
Published: (2025)
Context-Aware Inference via Performance Forecasting in Decentralized Learning Networks
by: Pfeffer, Joel, et al.
Published: (2025)
by: Pfeffer, Joel, et al.
Published: (2025)
Pre-Deployment Complexity Estimation for Federated Perception Systems
by: Solaiman, KMA, et al.
Published: (2026)
by: Solaiman, KMA, et al.
Published: (2026)
RL in the Wild: Characterizing RLVR Training in LLM Deployment
by: Zhou, Jiecheng, et al.
Published: (2025)
by: Zhou, Jiecheng, et al.
Published: (2025)
TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training
by: Wu, Houming, et al.
Published: (2025)
by: Wu, Houming, et al.
Published: (2025)
MineDraft: A Framework for Batch Parallel Speculative Decoding
by: Tang, Zhenwei, et al.
Published: (2026)
by: Tang, Zhenwei, et al.
Published: (2026)
Taming the Titans: A Survey of Efficient LLM Inference Serving
by: Zhen, Ranran, et al.
Published: (2025)
by: Zhen, Ranran, et al.
Published: (2025)
PithTrain: A Compact and Agent-Native MoE Training System
by: Lai, Ruihang, et al.
Published: (2026)
by: Lai, Ruihang, et al.
Published: (2026)
A Sparsity Predicting Approach for Large Language Models via Activation Pattern Clustering
by: Dhar, Nobel, et al.
Published: (2025)
by: Dhar, Nobel, et al.
Published: (2025)
A Structure-Agnostic Co-Tuning Framework for LLMs and SLMs in Cloud-Edge Systems
by: Liu, Yuze, et al.
Published: (2025)
by: Liu, Yuze, et al.
Published: (2025)
Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design
by: Zhang, Mohan, et al.
Published: (2025)
by: Zhang, Mohan, et al.
Published: (2025)
LongCat-Flash-Omni Technical Report
by: Meituan LongCat Team, et al.
Published: (2025)
by: Meituan LongCat Team, et al.
Published: (2025)
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead
by: Brüel-Gabrielsson, Rickard, et al.
Published: (2024)
by: Brüel-Gabrielsson, Rickard, et al.
Published: (2024)
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching
by: Zheng, Zhen, et al.
Published: (2024)
by: Zheng, Zhen, et al.
Published: (2024)
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications
by: Oliaro, Gabriele, et al.
Published: (2024)
by: Oliaro, Gabriele, et al.
Published: (2024)
DLoRA: Distributed Parameter-Efficient Fine-Tuning Solution for Large Language Model
by: Gao, Chao, et al.
Published: (2024)
by: Gao, Chao, et al.
Published: (2024)
Liger Kernel: Efficient Triton Kernels for LLM Training
by: Hsu, Pin-Lun, et al.
Published: (2024)
by: Hsu, Pin-Lun, et al.
Published: (2024)
Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model Inference
by: Li, Baolin, et al.
Published: (2024)
by: Li, Baolin, et al.
Published: (2024)
Fisher Information-based Efficient Curriculum Federated Learning with Large Language Models
by: Liu, Ji, et al.
Published: (2024)
by: Liu, Ji, et al.
Published: (2024)
TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training
by: Liang, Wanchao, et al.
Published: (2024)
by: Liang, Wanchao, et al.
Published: (2024)
BlackMamba: Mixture of Experts for State-Space Models
by: Anthony, Quentin, et al.
Published: (2024)
by: Anthony, Quentin, et al.
Published: (2024)
End-Cloud Collaboration Framework for Advanced AI Customer Service in E-commerce
by: Teng, Liangyu, et al.
Published: (2024)
by: Teng, Liangyu, et al.
Published: (2024)
Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference
by: Timor, Nadav, et al.
Published: (2024)
by: Timor, Nadav, et al.
Published: (2024)
Learned Best-Effort LLM Serving
by: Jha, Siddharth, et al.
Published: (2024)
by: Jha, Siddharth, et al.
Published: (2024)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale
by: Ku, Jerome, et al.
Published: (2025)
by: Ku, Jerome, et al.
Published: (2025)
Similar Items
-
Efficient Deployment of Large Language Models on Resource-constrained Devices
by: Yao, Zhiwei, et al.
Published: (2025) -
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
by: Yao, Jinghan, et al.
Published: (2024) -
LongCat-Flash Technical Report
by: Meituan LongCat Team, et al.
Published: (2025) -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
by: Xue, Fuzhao, et al.
Published: (2024) -
Deploying Atmospheric and Oceanic AI Models on Chinese Hardware and Framework: Migration Strategies, Performance Optimization and Analysis
by: Sun, Yuze, et al.
Published: (2025)