Saved in:
| Main Authors: | Fan, Zehao, Liu, Zhenyu, Liu, Yunzhen, Hou, Yayue, Benmeziane, Hadjer, Maghraoui, Kaoutar El, Liu, Liu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.04476 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing
by: Bessalah, Aniss, et al.
Published: (2025)
by: Bessalah, Aniss, et al.
Published: (2025)
Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
by: Liu, Lian, et al.
Published: (2025)
by: Liu, Lian, et al.
Published: (2025)
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
by: Fang, Yunhua, et al.
Published: (2025)
by: Fang, Yunhua, et al.
Published: (2025)
On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training
by: Wu, Zhaoxian, et al.
Published: (2024)
by: Wu, Zhaoxian, et al.
Published: (2024)
CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies
by: Gouk, Donghyun, et al.
Published: (2025)
by: Gouk, Donghyun, et al.
Published: (2025)
PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference
by: Gu, Yufeng, et al.
Published: (2025)
by: Gu, Yufeng, et al.
Published: (2025)
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)
by: Pan, Yudong, et al.
Published: (2026)
A Novel Extensible Simulation Framework for CXL-Enabled Systems
by: An, Yuda, et al.
Published: (2024)
by: An, Yuda, et al.
Published: (2024)
CXL-Interference: Analysis and Characterization in Modern Computer Systems
by: Mao, Shunyu, et al.
Published: (2024)
by: Mao, Shunyu, et al.
Published: (2024)
SparseST: Exploiting Data Sparsity in Spatiotemporal Modeling and Prediction
by: Wu, Junfeng, et al.
Published: (2025)
by: Wu, Junfeng, et al.
Published: (2025)
SmartQuant: CXL-based AI Model Store in Support of Runtime Configurable Weight Quantization
by: Xie, Rui, et al.
Published: (2024)
by: Xie, Rui, et al.
Published: (2024)
Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
by: Bambhaniya, Abhimanyu, et al.
Published: (2026)
by: Bambhaniya, Abhimanyu, et al.
Published: (2026)
Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
by: Kim, Dowon, et al.
Published: (2025)
by: Kim, Dowon, et al.
Published: (2025)
TRACE: Unlocking Effective CXL Bandwidth via Lossless Compression and Precision Scaling
by: Xie, Rui, et al.
Published: (2025)
by: Xie, Rui, et al.
Published: (2025)
SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference
by: Liu, Qunyou, et al.
Published: (2026)
by: Liu, Qunyou, et al.
Published: (2026)
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
by: Skliar, Andrii, et al.
Published: (2024)
by: Skliar, Andrii, et al.
Published: (2024)
Enabling Efficient Transaction Processing on CXL-Based Memory Sharing
by: Wang, Zhao, et al.
Published: (2025)
by: Wang, Zhao, et al.
Published: (2025)
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
by: Hwang, Ranggi, et al.
Published: (2023)
by: Hwang, Ranggi, et al.
Published: (2023)
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
by: Kim, Jungwoo, et al.
Published: (2026)
by: Kim, Jungwoo, et al.
Published: (2026)
CXL Topology-Aware and Expander-Driven Prefetching: Unlocking SSD Performance
by: Oh, Dongsuk, et al.
Published: (2025)
by: Oh, Dongsuk, et al.
Published: (2025)
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
by: Liu, Qingyuan, et al.
Published: (2025)
by: Liu, Qingyuan, et al.
Published: (2025)
The Case for Persistent CXL switches
by: Hadi, Khan Shaikhul, et al.
Published: (2025)
by: Hadi, Khan Shaikhul, et al.
Published: (2025)
Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing
by: Kiyawat, Khyati, et al.
Published: (2025)
by: Kiyawat, Khyati, et al.
Published: (2025)
A Full-System Simulation Framework for CXL-Based SSD Memory System
by: Wang, Yaohui, et al.
Published: (2025)
by: Wang, Yaohui, et al.
Published: (2025)
AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
by: Shende, Omkar B, et al.
Published: (2026)
by: Shende, Omkar B, et al.
Published: (2026)
CXL-DMSim: A Full-System CXL Disaggregated Memory Simulator With Comprehensive Silicon Validation
by: Wang, Yanjing, et al.
Published: (2024)
by: Wang, Yanjing, et al.
Published: (2024)
Continuous-Flow Data-Rate-Aware CNN Inference on FPGA
by: Habermann, Tobias, et al.
Published: (2026)
by: Habermann, Tobias, et al.
Published: (2026)
Differentiable Initialization-Accelerated CPU-GPU Hybrid Combinatorial Scheduling
by: Liu, Mingju, et al.
Published: (2026)
by: Liu, Mingju, et al.
Published: (2026)
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
by: Yun, Sungmin, et al.
Published: (2024)
by: Yun, Sungmin, et al.
Published: (2024)
FPGA-based Emulation and Device-Side Management for CXL-based Memory Tiering Systems
by: Chen, Yiqi, et al.
Published: (2025)
by: Chen, Yiqi, et al.
Published: (2025)
Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search
by: Ko, Seoyoung, et al.
Published: (2025)
by: Ko, Seoyoung, et al.
Published: (2025)
Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design
by: Xie, Rui, et al.
Published: (2025)
by: Xie, Rui, et al.
Published: (2025)
SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
by: Wang, Wenxun, et al.
Published: (2025)
by: Wang, Wenxun, et al.
Published: (2025)
Architectural and System Implications of CXL-enabled Tiered Memory
by: Yang, Yujie, et al.
Published: (2025)
by: Yang, Yujie, et al.
Published: (2025)
Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis
by: Li, Weikai, et al.
Published: (2024)
by: Li, Weikai, et al.
Published: (2024)
ACE-RTL: When Agentic Context Evolution Meets RTL-Specialized LLMs
by: Deng, Chenhui, et al.
Published: (2026)
by: Deng, Chenhui, et al.
Published: (2026)
Octopus: Enhancing CXL Memory Pods via Sparse Topology
by: Zhong, Yuhong, et al.
Published: (2025)
by: Zhong, Yuhong, et al.
Published: (2025)
LMB: Augmenting PCIe Devices with CXL-Linked Memory Buffer
by: Wang, Jiapin, et al.
Published: (2024)
by: Wang, Jiapin, et al.
Published: (2024)
Expert Streaming: Accelerating Low-Batch MoE Inference via Multi-chiplet Architecture and Dynamic Expert Trajectory Scheduling
by: Ma, Songchen, et al.
Published: (2026)
by: Ma, Songchen, et al.
Published: (2026)
Data-Rate-Aware High-Speed CNN Inference on FPGAs
by: Habermann, Tobias, et al.
Published: (2026)
by: Habermann, Tobias, et al.
Published: (2026)
Similar Items
-
AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing
by: Bessalah, Aniss, et al.
Published: (2025) -
Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
by: Liu, Lian, et al.
Published: (2025) -
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
by: Fang, Yunhua, et al.
Published: (2025) -
On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training
by: Wu, Zhaoxian, et al.
Published: (2024) -
CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies
by: Gouk, Donghyun, et al.
Published: (2025)