Saved in:
| Main Authors: | Huang, Haochen, Zhong, Shuzhang, Zhang, Zhe, Li, Shuangchen, Niu, Dimin, Zheng, Hongzhong, Wang, Runsheng, Li, Meng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.09420 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference
by: Fu, Zizhuo, et al.
Published: (2025)
by: Fu, Zizhuo, et al.
Published: (2025)
MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models
by: Chitty-Venkata, Krishna Teja, et al.
Published: (2025)
by: Chitty-Venkata, Krishna Teja, et al.
Published: (2025)
AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2024)
by: Zhong, Shuzhang, et al.
Published: (2024)
MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache
by: Xue, Leyang, et al.
Published: (2024)
by: Xue, Leyang, et al.
Published: (2024)
ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
by: Shen, Zixu, et al.
Published: (2025)
by: Shen, Zixu, et al.
Published: (2025)
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2025)
by: Zhong, Shuzhang, et al.
Published: (2025)
MoEITS: A Green AI approach for simplifying MoE-LLMs
by: Balderas, Luis, et al.
Published: (2026)
by: Balderas, Luis, et al.
Published: (2026)
Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
by: Chu, Kexin, et al.
Published: (2025)
by: Chu, Kexin, et al.
Published: (2025)
LaMoSys3.5D: Enabling 3.5D-IC-Based Large Language Model Inference Serving Systems via Hardware/Software Co-Design
by: Wang, Qipan, et al.
Published: (2025)
by: Wang, Qipan, et al.
Published: (2025)
How Much Parallelism Is "Free"? A Principle of Near-Free Parallelism for Parallel Decoding
by: He, Minghua, et al.
Published: (2026)
by: He, Minghua, et al.
Published: (2026)
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
by: Yao, Feiyu, et al.
Published: (2026)
by: Yao, Feiyu, et al.
Published: (2026)
EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC
by: Shen, Siyuan, et al.
Published: (2025)
by: Shen, Siyuan, et al.
Published: (2025)
ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
by: Zhong, Shuzhang, et al.
Published: (2024)
by: Zhong, Shuzhang, et al.
Published: (2024)
DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing
by: Wang, Liangyu, et al.
Published: (2025)
by: Wang, Liangyu, et al.
Published: (2025)
Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model
by: Chen, Mu-Chi, et al.
Published: (2025)
by: Chen, Mu-Chi, et al.
Published: (2025)
Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA
by: Mitra, Subhadip
Published: (2026)
by: Mitra, Subhadip
Published: (2026)
Mixture of Experts with Mixture of Precisions for Tuning Quality of Service
by: Imani, HamidReza, et al.
Published: (2024)
by: Imani, HamidReza, et al.
Published: (2024)
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
by: Zhong, Shuzhang, et al.
Published: (2026)
by: Zhong, Shuzhang, et al.
Published: (2026)
Tuning Fast Memory Size based on Modeling of Page Migration for Tiered Memory
by: Chen, Shangye, et al.
Published: (2024)
by: Chen, Shangye, et al.
Published: (2024)
Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing
by: Wang, Yuxin, et al.
Published: (2023)
by: Wang, Yuxin, et al.
Published: (2023)
Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
by: Ren, Jie, et al.
Published: (2025)
by: Ren, Jie, et al.
Published: (2025)
Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels
by: Lacey, Dane C., et al.
Published: (2024)
by: Lacey, Dane C., et al.
Published: (2024)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
by: Chen, Xiaodong, et al.
Published: (2025)
by: Chen, Xiaodong, et al.
Published: (2025)
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory
by: Suo, Jiashun, et al.
Published: (2025)
by: Suo, Jiashun, et al.
Published: (2025)
REAM: Merging Improves Pruning of Experts in LLMs
by: Jha, Saurav, et al.
Published: (2026)
by: Jha, Saurav, et al.
Published: (2026)
Matryoshka: Optimization of Dynamic Diverse Quantum Chemistry Systems via Elastic Parallelism Transformation
by: Wang, Tuowei, et al.
Published: (2024)
by: Wang, Tuowei, et al.
Published: (2024)
A$^3$PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader
by: Jiang, Qingcai, et al.
Published: (2024)
by: Jiang, Qingcai, et al.
Published: (2024)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding
by: Israel, Daniel, et al.
Published: (2025)
by: Israel, Daniel, et al.
Published: (2025)
Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
by: Jo, Myeong Jun
Published: (2026)
by: Jo, Myeong Jun
Published: (2026)
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
by: Fang, Yunhua, et al.
Published: (2025)
by: Fang, Yunhua, et al.
Published: (2025)
Optimal Parallel Scheduling under Concave Speedup Functions
by: Li, Chengzhang, et al.
Published: (2025)
by: Li, Chengzhang, et al.
Published: (2025)
Characterizing Machine Learning Force Fields as Emerging Molecular Dynamics Workloads on Graphics Processing Units
by: De Alwis, Udari, et al.
Published: (2026)
by: De Alwis, Udari, et al.
Published: (2026)
PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
by: Hourri, Younes, et al.
Published: (2025)
by: Hourri, Younes, et al.
Published: (2025)
Heterogeneous Memory Pool Tuning
by: Vaverka, Filip, et al.
Published: (2025)
by: Vaverka, Filip, et al.
Published: (2025)
AI Load Dynamics--A Power Electronics Perspective
by: Li, Yuzhuo, et al.
Published: (2025)
by: Li, Yuzhuo, et al.
Published: (2025)
Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
by: Ng, Nathan, et al.
Published: (2026)
by: Ng, Nathan, et al.
Published: (2026)
Robust Recursive Query Parallelism in Graph Database Management Systems
by: Chakraborty, Anurag, et al.
Published: (2025)
by: Chakraborty, Anurag, et al.
Published: (2025)
Updates on the Low-Level Abstraction of Memory Access
by: Gruber, Bernhard Manfred
Published: (2023)
by: Gruber, Bernhard Manfred
Published: (2023)
Spatiotemporal Analysis of Parallelized Computing at the Extreme Edge
by: Nabil, Yasser, et al.
Published: (2025)
by: Nabil, Yasser, et al.
Published: (2025)
Similar Items
-
H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference
by: Fu, Zizhuo, et al.
Published: (2025) -
MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models
by: Chitty-Venkata, Krishna Teja, et al.
Published: (2025) -
AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2024) -
MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache
by: Xue, Leyang, et al.
Published: (2024) -
ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
by: Shen, Zixu, et al.
Published: (2025)