:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Fan, Zehao, Liu, Zhenyu, Liu, Yunzhen, Hou, Yayue, Benmeziane, Hadjer, Maghraoui, Kaoutar El, Liu, Liu
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Hardware Architecture
Online Access:	https://arxiv.org/abs/2512.04476
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing
by: Bessalah, Aniss, et al.
Published: (2025)

Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
by: Liu, Lian, et al.
Published: (2025)

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
by: Fang, Yunhua, et al.
Published: (2025)

On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training
by: Wu, Zhaoxian, et al.
Published: (2024)

CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies
by: Gouk, Donghyun, et al.
Published: (2025)

PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference
by: Gu, Yufeng, et al.
Published: (2025)

TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)

A Novel Extensible Simulation Framework for CXL-Enabled Systems
by: An, Yuda, et al.
Published: (2024)

CXL-Interference: Analysis and Characterization in Modern Computer Systems
by: Mao, Shunyu, et al.
Published: (2024)

SparseST: Exploiting Data Sparsity in Spatiotemporal Modeling and Prediction
by: Wu, Junfeng, et al.
Published: (2025)

SmartQuant: CXL-based AI Model Store in Support of Runtime Configurable Weight Quantization
by: Xie, Rui, et al.
Published: (2024)

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
by: Bambhaniya, Abhimanyu, et al.
Published: (2026)

Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
by: Kim, Dowon, et al.
Published: (2025)

TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling
by: Xie, Rui, et al.
Published: (2025)

SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference
by: Liu, Qunyou, et al.
Published: (2026)

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
by: Skliar, Andrii, et al.
Published: (2024)

Enabling Efficient Transaction Processing on CXL-Based Memory Sharing
by: Wang, Zhao, et al.
Published: (2025)

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
by: Hwang, Ranggi, et al.
Published: (2023)

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
by: Kim, Jungwoo, et al.
Published: (2026)

CXL Topology-Aware and Expander-Driven Prefetching: Unlocking SSD Performance
by: Oh, Dongsuk, et al.
Published: (2025)

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
by: Liu, Qingyuan, et al.
Published: (2025)

The Case for Persistent CXL switches
by: Hadi, Khan Shaikhul, et al.
Published: (2025)

Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing
by: Kiyawat, Khyati, et al.
Published: (2025)

A Full-System Simulation Framework for CXL-Based SSD Memory System
by: Wang, Yaohui, et al.
Published: (2025)

AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
by: Shende, Omkar B, et al.
Published: (2026)

CXL-DMSim: A Full-System CXL Disaggregated Memory Simulator With Comprehensive Silicon Validation
by: Wang, Yanjing, et al.
Published: (2024)

Continuous-Flow Data-Rate-Aware CNN Inference on FPGA
by: Habermann, Tobias, et al.
Published: (2026)

Differentiable Initialization-Accelerated CPU-GPU Hybrid Combinatorial Scheduling
by: Liu, Mingju, et al.
Published: (2026)

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
by: Yun, Sungmin, et al.
Published: (2024)

FPGA-based Emulation and Device-Side Management for CXL-based Memory Tiering Systems
by: Chen, Yiqi, et al.
Published: (2025)

Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search
by: Ko, Seoyoung, et al.
Published: (2025)

Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design
by: Xie, Rui, et al.
Published: (2025)

SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
by: Wang, Wenxun, et al.
Published: (2025)

Architectural and System Implications of CXL-enabled Tiered Memory
by: Yang, Yujie, et al.
Published: (2025)

Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis
by: Li, Weikai, et al.
Published: (2024)

ACE-RTL: When Agentic Context Evolution Meets RTL-Specialized LLMs
by: Deng, Chenhui, et al.
Published: (2026)

Octopus: Enhancing CXL Memory Pods via Sparse Topology
by: Zhong, Yuhong, et al.
Published: (2025)

LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer
by: Wang, Jiapin, et al.
Published: (2024)

Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling
by: Ma, Songchen, et al.
Published: (2026)

Data-Rate-Aware High-Speed CNN Inference on FPGAs
by: Habermann, Tobias, et al.
Published: (2026)