:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Huang, Haochen, Zhong, Shuzhang, Zhang, Zhe, Li, Shuangchen, Niu, Dimin, Zheng, Hongzhong, Wang, Runsheng, Li, Meng
Format:	Preprint
Published:	2025
Subjects:	Performance
Online Access:	https://arxiv.org/abs/2509.09420
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference
by: Fu, Zizhuo, et al.
Published: (2025)

MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models
by: Chitty-Venkata, Krishna Teja, et al.
Published: (2025)

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2024)

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache
by: Xue, Leyang, et al.
Published: (2024)

ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
by: Shen, Zixu, et al.
Published: (2025)

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
by: Zhong, Shuzhang, et al.
Published: (2025)

MoEITS: A Green AI approach for simplifying MoE-LLMs
by: Balderas, Luis, et al.
Published: (2026)

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
by: Chu, Kexin, et al.
Published: (2025)

LaMoSys3.5D: Enabling 3.5D-IC-Based Large Language Model Inference Serving Systems via Hardware/Software Co-Design
by: Wang, Qipan, et al.
Published: (2025)

How Much Parallelism Is "Free"? A Principle of Near-Free Parallelism for Parallel Decoding
by: He, Minghua, et al.
Published: (2026)

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
by: Yao, Feiyu, et al.
Published: (2026)

EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC
by: Shen, Siyuan, et al.
Published: (2025)

ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
by: Zhong, Shuzhang, et al.
Published: (2024)

DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing
by: Wang, Liangyu, et al.
Published: (2025)

Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model
by: Chen, Mu-Chi, et al.
Published: (2025)

Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA
by: Mitra, Subhadip
Published: (2026)

Mixture of Experts with Mixture of Precisions for Tuning Quality of Service
by: Imani, HamidReza, et al.
Published: (2024)

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration
by: Zhong, Shuzhang, et al.
Published: (2026)

Tuning Fast Memory Size based on Modeling of Page Migration for Tiered Memory
by: Chen, Shangye, et al.
Published: (2024)

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing
by: Wang, Yuxin, et al.
Published: (2023)

Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
by: Ren, Jie, et al.
Published: (2025)

Cache Blocking of Distributed-Memory Parallel Matrix Power Kernels
by: Lacey, Dane C., et al.
Published: (2024)

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
by: Li, Yunxin, et al.
Published: (2024)

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
by: Chen, Xiaodong, et al.
Published: (2025)

CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory
by: Suo, Jiashun, et al.
Published: (2025)

REAM: Merging Improves Pruning of Experts in LLMs
by: Jha, Saurav, et al.
Published: (2026)

Matryoshka: Optimization of Dynamic Diverse Quantum Chemistry Systems via Elastic Parallelism Transformation
by: Wang, Tuowei, et al.
Published: (2024)

A$^3$PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader
by: Jiang, Qingcai, et al.
Published: (2024)

Accelerating Diffusion LLMs via Adaptive Parallel Decoding
by: Israel, Daniel, et al.
Published: (2025)

Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory
by: Jo, Myeong Jun
Published: (2026)

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
by: Fang, Yunhua, et al.
Published: (2025)

Optimal Parallel Scheduling under Concave Speedup Functions
by: Li, Chengzhang, et al.
Published: (2025)

Characterizing Machine Learning Force Fields as Emerging Molecular Dynamics Workloads on Graphics Processing Units
by: De Alwis, Udari, et al.
Published: (2026)

PATCH: Learnable Tile-level Hybrid Sparsity for LLMs
by: Hourri, Younes, et al.
Published: (2025)

Heterogeneous Memory Pool Tuning
by: Vaverka, Filip, et al.
Published: (2025)

AI Load Dynamics--A Power Electronics Perspective
by: Li, Yuzhuo, et al.
Published: (2025)

Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
by: Ng, Nathan, et al.
Published: (2026)

Robust Recursive Query Parallelism in Graph Database Management Systems
by: Chakraborty, Anurag, et al.
Published: (2025)

Updates on the Low-Level Abstraction of Memory Access
by: Gruber, Bernhard Manfred
Published: (2023)

Spatiotemporal Analysis of Parallelized Computing at the Extreme Edge
by: Nabil, Yasser, et al.
Published: (2025)