:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ma, Haiyue, Du, Zhixu, Chen, Yiran
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Hardware Architecture
Online Access:	https://arxiv.org/abs/2506.07366
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

FlashMoE: Fast Distributed MoE in a Single Kernel
by: Aimuyo, Osayamen Jonathan, et al.
Published: (2025)

Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
by: Yu, Yanpeng, et al.
Published: (2025)

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
by: Hwang, Ranggi, et al.
Published: (2023)

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference
by: Choi, Yuseon, et al.
Published: (2025)

Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
by: Pan, Yue, et al.
Published: (2025)

Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling
by: Ma, Songchen, et al.
Published: (2026)

Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs
by: Zhang, Qijun, et al.
Published: (2026)

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems
by: Zhou, Zhuoshan, et al.
Published: (2026)

AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
by: Shende, Omkar B, et al.
Published: (2026)

A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration
by: Huang, Wei-Hsing, et al.
Published: (2025)

Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
by: Yu, Zhongkai, et al.
Published: (2025)

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures
by: Luo, Shuqing, et al.
Published: (2026)

ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
by: Choi, Yuseon, et al.
Published: (2026)

RouteScan: A Non-Intrusive Approach to Auditing MoE LLMs Safety via Expert Routing Telemetry
by: Lv, Bo, et al.
Published: (2026)

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
by: Kim, Taehyun, et al.
Published: (2024)

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)

BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration
by: Chen, Yuzong, et al.
Published: (2024)

On the Shape of Latent Variables in a Denoising VAE-MoG: A Posterior Sampling-Based Study
by: Bascuñán, Fernanda Zapata
Published: (2025)

Accelerating Frontier MoE Training with 3D Integrated Optics
by: Bernadskiy, Mikhail, et al.
Published: (2025)

End-to-End Transformer Acceleration Through Processing-in-Memory Architectures
by: Yang, Xiaoxuan, et al.
Published: (2025)

UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA
by: Dong, Jiale, et al.
Published: (2025)

EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture
by: Duan, Bowen, et al.
Published: (2026)

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems
by: Fan, Zehao, et al.
Published: (2025)

Graph Neural Networks Based Analog Circuit Link Prediction
by: Pan, Guanyuan, et al.
Published: (2025)

Unsupervised Graph Neural Network Framework for Balanced Multipatterning in Advanced Electronic Design Automation Layouts
by: Helaly, Abdelrahman, et al.
Published: (2025)

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
by: Yun, Sungmin, et al.
Published: (2024)

Hardware-Aware Data and Instruction Mapping for AI Tasks: Balancing Parallelism, I/O and Memory Tradeoffs
by: Chowdhury, Md Rownak Hossain, et al.
Published: (2025)

MonoSparse-CAM: Efficient Tree Model Processing via Monotonicity and Sparsity in CAMs
by: Molom-Ochir, Tergel, et al.
Published: (2024)

CAMformer: Associative Memory is All You Need
by: Molom-Ochir, Tergel, et al.
Published: (2025)

Hardware-Aware Neural Dropout Search for Reliable Uncertainty Prediction on FPGA
by: Zhang, Zehuan, et al.
Published: (2024)

LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation
by: Chang, Chen-Chia, et al.
Published: (2025)

Algorithmic Strategies for Sustainable Reuse of Neural Network Accelerators with Permanent Faults
by: Alama, Youssef A. Ait, et al.
Published: (2024)

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
by: Bambhaniya, Abhimanyu, et al.
Published: (2026)

AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies
by: Sharma, Amit
Published: (2025)

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
by: Liu, Qingyuan, et al.
Published: (2025)

PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design
by: Wu, Yuchao, et al.
Published: (2025)

LaMAGIC: Language-Model-based Topology Generation for Analog Integrated Circuits
by: Chang, Chen-Chia, et al.
Published: (2024)

PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory Access Prediction Models
by: Gupta, Neelesh, et al.
Published: (2024)

QiMeng: Fully Automated Hardware and Software Design for Processor Chip
by: Zhang, Rui, et al.
Published: (2025)

Dynamic Tsetlin Machine Accelerators for On-Chip Training at the Edge using FPGAs
by: Mao, Gang, et al.
Published: (2025)