Saved in:
| Main Authors: | Ma, Haiyue, Du, Zhixu, Chen, Yiran |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.07366 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
FlashMoE: Fast Distributed MoE in a Single Kernel
by: Aimuyo, Osayamen Jonathan, et al.
Published: (2025)
by: Aimuyo, Osayamen Jonathan, et al.
Published: (2025)
Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
by: Yu, Yanpeng, et al.
Published: (2025)
by: Yu, Yanpeng, et al.
Published: (2025)
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
by: Hwang, Ranggi, et al.
Published: (2023)
by: Hwang, Ranggi, et al.
Published: (2023)
SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference
by: Choi, Yuseon, et al.
Published: (2025)
by: Choi, Yuseon, et al.
Published: (2025)
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
by: Pan, Yue, et al.
Published: (2025)
by: Pan, Yue, et al.
Published: (2025)
Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling
by: Ma, Songchen, et al.
Published: (2026)
by: Ma, Songchen, et al.
Published: (2026)
Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
by: Zhang, Qijun, et al.
Published: (2026)
by: Zhang, Qijun, et al.
Published: (2026)
MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)
by: Zhou, Zhuoshan, et al.
Published: (2026)
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
by: Shende, Omkar B, et al.
Published: (2026)
by: Shende, Omkar B, et al.
Published: (2026)
A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration
by: Huang, Wei-Hsing, et al.
Published: (2025)
by: Huang, Wei-Hsing, et al.
Published: (2025)
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
by: Yu, Zhongkai, et al.
Published: (2025)
by: Yu, Zhongkai, et al.
Published: (2025)
Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures
by: Luo, Shuqing, et al.
Published: (2026)
by: Luo, Shuqing, et al.
Published: (2026)
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
by: Choi, Yuseon, et al.
Published: (2026)
by: Choi, Yuseon, et al.
Published: (2026)
RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry
by: Lv, Bo, et al.
Published: (2026)
by: Lv, Bo, et al.
Published: (2026)
MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
by: Kim, Taehyun, et al.
Published: (2024)
by: Kim, Taehyun, et al.
Published: (2024)
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)
by: Pan, Yudong, et al.
Published: (2026)
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
by: Chen, Yuzong, et al.
Published: (2024)
by: Chen, Yuzong, et al.
Published: (2024)
On the Shape of Latent Variables in a Denoising VAE-MoG: A Posterior Sampling-Based Study
by: Bascuñán, Fernanda Zapata
Published: (2025)
by: Bascuñán, Fernanda Zapata
Published: (2025)
Accelerating Frontier MoE Training with 3D Integrated Optics
by: Bernadskiy, Mikhail, et al.
Published: (2025)
by: Bernadskiy, Mikhail, et al.
Published: (2025)
End-to-End Transformer Acceleration Through Processing-in-Memory Architectures
by: Yang, Xiaoxuan, et al.
Published: (2025)
by: Yang, Xiaoxuan, et al.
Published: (2025)
UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA
by: Dong, Jiale, et al.
Published: (2025)
by: Dong, Jiale, et al.
Published: (2025)
EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture
by: Duan, Bowen, et al.
Published: (2026)
by: Duan, Bowen, et al.
Published: (2026)
Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems
by: Fan, Zehao, et al.
Published: (2025)
by: Fan, Zehao, et al.
Published: (2025)
Graph Neural Networks Based Analog Circuit Link Prediction
by: Pan, Guanyuan, et al.
Published: (2025)
by: Pan, Guanyuan, et al.
Published: (2025)
Unsupervised Graph Neural Network Framework for Balanced Multipatterning in Advanced Electronic Design Automation Layouts
by: Helaly, Abdelrahman, et al.
Published: (2025)
by: Helaly, Abdelrahman, et al.
Published: (2025)
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
by: Yun, Sungmin, et al.
Published: (2024)
by: Yun, Sungmin, et al.
Published: (2024)
Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs
by: Chowdhury, Md Rownak Hossain, et al.
Published: (2025)
by: Chowdhury, Md Rownak Hossain, et al.
Published: (2025)
MonoSparse-CAM: Efficient Tree Model Processing via Monotonicity and Sparsity in CAMs
by: Molom-Ochir, Tergel, et al.
Published: (2024)
by: Molom-Ochir, Tergel, et al.
Published: (2024)
CAMformer: Associative Memory is All You Need
by: Molom-Ochir, Tergel, et al.
Published: (2025)
by: Molom-Ochir, Tergel, et al.
Published: (2025)
Hardware-Aware Neural Dropout Search for Reliable Uncertainty Prediction on FPGA
by: Zhang, Zehuan, et al.
Published: (2024)
by: Zhang, Zehuan, et al.
Published: (2024)
LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation
by: Chang, Chen-Chia, et al.
Published: (2025)
by: Chang, Chen-Chia, et al.
Published: (2025)
Algorithmic Strategies for Sustainable Reuse of Neural Network Accelerators with Permanent Faults
by: Alama, Youssef A. Ait, et al.
Published: (2024)
by: Alama, Youssef A. Ait, et al.
Published: (2024)
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
by: Bambhaniya, Abhimanyu, et al.
Published: (2026)
by: Bambhaniya, Abhimanyu, et al.
Published: (2026)
AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies
by: Sharma, Amit
Published: (2025)
by: Sharma, Amit
Published: (2025)
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
by: Liu, Qingyuan, et al.
Published: (2025)
by: Liu, Qingyuan, et al.
Published: (2025)
PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design
by: Wu, Yuchao, et al.
Published: (2025)
by: Wu, Yuchao, et al.
Published: (2025)
LaMAGIC: Language-Model-based Topology Generation for Analog Integrated Circuits
by: Chang, Chen-Chia, et al.
Published: (2024)
by: Chang, Chen-Chia, et al.
Published: (2024)
PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory Access Prediction Models
by: Gupta, Neelesh, et al.
Published: (2024)
by: Gupta, Neelesh, et al.
Published: (2024)
QiMeng: Fully Automated Hardware and Software Design for Processor Chip
by: Zhang, Rui, et al.
Published: (2025)
by: Zhang, Rui, et al.
Published: (2025)
Dynamic Tsetlin Machine Accelerators for On-Chip Training at the Edge using FPGAs
by: Mao, Gang, et al.
Published: (2025)
by: Mao, Gang, et al.
Published: (2025)
Similar Items
-
FlashMoE: Fast Distributed MoE in a Single Kernel
by: Aimuyo, Osayamen Jonathan, et al.
Published: (2025) -
Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
by: Yu, Yanpeng, et al.
Published: (2025) -
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
by: Hwang, Ranggi, et al.
Published: (2023) -
SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference
by: Choi, Yuseon, et al.
Published: (2025) -
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
by: Pan, Yue, et al.
Published: (2025)