Saved in:
| Main Authors: | Ren, Jie, Ma, Bin, Yang, Shuangyan, Francis, Benjamin, Ardestani, Ehsan K., Si, Min, Li, Dong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.08568 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Tuning Fast Memory Size based on Modeling of Page Migration for Tiered Memory
by: Chen, Shangye, et al.
Published: (2024)
by: Chen, Shangye, et al.
Published: (2024)
Dissecting Embedding Bag Performance in DLRM Inference
by: Ambati, Chandrish, et al.
Published: (2025)
by: Ambati, Chandrish, et al.
Published: (2025)
Performance Characterization of AutoNUMA Memory Tiering on Graph Analytics
by: Moura, Diego, et al.
Published: (2022)
by: Moura, Diego, et al.
Published: (2022)
A Limits Study of Memory-side Tiering Telemetry
by: Petrucci, Vinicius, et al.
Published: (2025)
by: Petrucci, Vinicius, et al.
Published: (2025)
Exploring and Evaluating Real-world CXL: Use Cases and System Adoption
by: Wang, Xi, et al.
Published: (2024)
by: Wang, Xi, et al.
Published: (2024)
Hybrid Adaptive Tuning for Tiered Memory Systems
by: Wang, Xi, et al.
Published: (2026)
by: Wang, Xi, et al.
Published: (2026)
Modeling Utilization to Identify Shared-Memory Atomic Bottlenecks
by: Dong, Rongcui, et al.
Published: (2025)
by: Dong, Rongcui, et al.
Published: (2025)
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
by: Ganjihal, Sanjeev Rao
Published: (2026)
by: Ganjihal, Sanjeev Rao
Published: (2026)
Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms
by: Zhang, Yaozheng, et al.
Published: (2025)
by: Zhang, Yaozheng, et al.
Published: (2025)
AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference
by: Zhao, Xuanlei, et al.
Published: (2024)
by: Zhao, Xuanlei, et al.
Published: (2024)
Optimizing System Memory Bandwidth with Micron CXL Memory Expansion Modules on Intel Xeon 6 Processors
by: Sehgal, Rohit, et al.
Published: (2024)
by: Sehgal, Rohit, et al.
Published: (2024)
Energy-Efficient Transformer Inference: Optimization Strategies for Time Series Classification
by: Kermani, Arshia, et al.
Published: (2025)
by: Kermani, Arshia, et al.
Published: (2025)
Heterogeneous Memory Pool Tuning
by: Vaverka, Filip, et al.
Published: (2025)
by: Vaverka, Filip, et al.
Published: (2025)
FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models
by: Shao, Zishan, et al.
Published: (2025)
by: Shao, Zishan, et al.
Published: (2025)
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
by: Fang, Yunhua, et al.
Published: (2025)
by: Fang, Yunhua, et al.
Published: (2025)
Examem: Low-Overhead Memory Instrumentation for Intelligent Memory Systems
by: Poduval, Ashwin, et al.
Published: (2024)
by: Poduval, Ashwin, et al.
Published: (2024)
ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory
by: Wang, Liangyu, et al.
Published: (2025)
by: Wang, Liangyu, et al.
Published: (2025)
Glinthawk: A Two-Tiered Architecture for Offline LLM Inference
by: Hamadanian, Pouya, et al.
Published: (2025)
by: Hamadanian, Pouya, et al.
Published: (2025)
Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs
by: Ng, Nathan, et al.
Published: (2026)
by: Ng, Nathan, et al.
Published: (2026)
Updates on the Low-Level Abstraction of Memory Access
by: Gruber, Bernhard Manfred
Published: (2023)
by: Gruber, Bernhard Manfred
Published: (2023)
FB$^+$-tree: A Memory-Optimized B$^+$-tree with Latch-Free Update
by: Chen, Yuan, et al.
Published: (2025)
by: Chen, Yuan, et al.
Published: (2025)
DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
by: Yang, Peiming, et al.
Published: (2025)
by: Yang, Peiming, et al.
Published: (2025)
Analysis and Evaluation of Using Microsecond-Latency Memory for In-Memory Indices and Caches in SSD-Based Key-Value Stores
by: Bando, Yosuke, et al.
Published: (2025)
by: Bando, Yosuke, et al.
Published: (2025)
Heterogeneous Memory Benchmarking Toolkit
by: Ghaemi, Golsana, et al.
Published: (2025)
by: Ghaemi, Golsana, et al.
Published: (2025)
Virtual-Memory Powersort
by: Moltmann, Finn, et al.
Published: (2026)
by: Moltmann, Finn, et al.
Published: (2026)
EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC
by: Shen, Siyuan, et al.
Published: (2025)
by: Shen, Siyuan, et al.
Published: (2025)
Beating vDSP: A 138 GFLOPS Radix-8 Stockham FFT on Apple Silicon via Two-Tier Register-Threadgroup Memory Decomposition
by: Bergach, Mohamed Amine
Published: (2026)
by: Bergach, Mohamed Amine
Published: (2026)
CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory
by: Suo, Jiashun, et al.
Published: (2025)
by: Suo, Jiashun, et al.
Published: (2025)
HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
by: Huang, Haochen, et al.
Published: (2025)
by: Huang, Haochen, et al.
Published: (2025)
WritePolicyBench: Benchmarking Memory Write Policies under Byte Budgets
by: Cham, Edgard El
Published: (2026)
by: Cham, Edgard El
Published: (2026)
A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture
by: Pratipat, Gyan
Published: (2026)
by: Pratipat, Gyan
Published: (2026)
Anatomizing Deep Learning Inference in Web Browsers
by: Wang, Qipeng, et al.
Published: (2024)
by: Wang, Qipeng, et al.
Published: (2024)
From Profiling to Optimization: Unveiling the Profile Guided Optimization
by: Liu, Bingxin, et al.
Published: (2025)
by: Liu, Bingxin, et al.
Published: (2025)
Putting the Context back into Memory
by: Roberts, David A.
Published: (2025)
by: Roberts, David A.
Published: (2025)
Memory Analysis on the Training Course of DeepSeek Models
by: Zhang, Ping, et al.
Published: (2025)
by: Zhang, Ping, et al.
Published: (2025)
ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
by: Shen, Zixu, et al.
Published: (2025)
by: Shen, Zixu, et al.
Published: (2025)
DGAP: Efficient Dynamic Graph Analysis on Persistent Memory
by: Islam, Abdullah Al Raqibul, et al.
Published: (2024)
by: Islam, Abdullah Al Raqibul, et al.
Published: (2024)
ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
by: Fan, Ruibo, et al.
Published: (2026)
by: Fan, Ruibo, et al.
Published: (2026)
CUTHERMO: Understanding GPU Memory Inefficiencies with Heat Map Profiling
by: Zhao, Yanbo, et al.
Published: (2025)
by: Zhao, Yanbo, et al.
Published: (2025)
ChatNeuroSim: An LLM Agent Framework for Automated Compute-in-Memory Accelerator Deployment and Optimization
by: Lee, Ming-Yen, et al.
Published: (2026)
by: Lee, Ming-Yen, et al.
Published: (2026)
Similar Items
-
Tuning Fast Memory Size based on Modeling of Page Migration for Tiered Memory
by: Chen, Shangye, et al.
Published: (2024) -
Dissecting Embedding Bag Performance in DLRM Inference
by: Ambati, Chandrish, et al.
Published: (2025) -
Performance Characterization of AutoNUMA Memory Tiering on Graph Analytics
by: Moura, Diego, et al.
Published: (2022) -
A Limits Study of Memory-side Tiering Telemetry
by: Petrucci, Vinicius, et al.
Published: (2025) -
Exploring and Evaluating Real-world CXL: Use Cases and System Adoption
by: Wang, Xi, et al.
Published: (2024)