Guardat en:
| Autors principals: | Liu, Lian, Zhao, Shixin, Zhou, Yutian, He, Yintao, Wang, Mengdi, Han, Yinhe, Wang, Ying |
|---|---|
| Format: | Preprint |
| Publicat: |
2026
|
| Matèries: | |
| Accés en línia: | https://arxiv.org/abs/2602.11521 |
| Etiquetes: |
Afegir etiqueta
Sense etiquetes, Sigues el primer a etiquetar aquest registre!
|
Ítems similars
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
per: Pan, Yudong, et al.
Publicat: (2026)
per: Pan, Yudong, et al.
Publicat: (2026)
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
per: Meng, William, et al.
Publicat: (2025)
per: Meng, William, et al.
Publicat: (2025)
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
per: li, Fei, et al.
Publicat: (2026)
per: li, Fei, et al.
Publicat: (2026)
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
per: Qin, Ruoyu, et al.
Publicat: (2024)
per: Qin, Ruoyu, et al.
Publicat: (2024)
Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
per: Yu, Yanpeng, et al.
Publicat: (2025)
per: Yu, Yanpeng, et al.
Publicat: (2025)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
per: Yüzügüler, Ahmet Caner, et al.
Publicat: (2025)
per: Yüzügüler, Ahmet Caner, et al.
Publicat: (2025)
Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
per: Zheng, Xianzhe, et al.
Publicat: (2026)
per: Zheng, Xianzhe, et al.
Publicat: (2026)
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
per: Wang, Haoxuan, et al.
Publicat: (2026)
per: Wang, Haoxuan, et al.
Publicat: (2026)
ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
per: Fan, Ruibo, et al.
Publicat: (2026)
per: Fan, Ruibo, et al.
Publicat: (2026)
FpgaHub: Fpga-centric Hyper-heterogeneous Computing Platform for Big Data Analytics
per: Wang, Zeke, et al.
Publicat: (2025)
per: Wang, Zeke, et al.
Publicat: (2025)
RevaMp3D: Architecting the Processor Core and Cache Hierarchy for Systems with Monolithically-Integrated Logic and Memory
per: Ghiasi, Nika Mansouri, et al.
Publicat: (2022)
per: Ghiasi, Nika Mansouri, et al.
Publicat: (2022)
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
per: He, Yintao, et al.
Publicat: (2025)
per: He, Yintao, et al.
Publicat: (2025)
GreenLLM: Disaggregating Large Language Model Serving on Heterogeneous GPUs for Lower Carbon Emissions
per: Shi, Tianyao, et al.
Publicat: (2024)
per: Shi, Tianyao, et al.
Publicat: (2024)
A Modern Primer on Processing in Memory
per: Mutlu, Onur, et al.
Publicat: (2020)
per: Mutlu, Onur, et al.
Publicat: (2020)
PiKV: KV Cache Management System for Mixture of Experts
per: Liu, Dong, et al.
Publicat: (2025)
per: Liu, Dong, et al.
Publicat: (2025)
Accelerating Triangle Counting with Real Processing-in-Memory Systems
per: Asquini, Lorenzo, et al.
Publicat: (2025)
per: Asquini, Lorenzo, et al.
Publicat: (2025)
Balanced Data Placement for GEMV Acceleration with Processing-In-Memory
per: Ibrahim, Mohamed Assem, et al.
Publicat: (2024)
per: Ibrahim, Mohamed Assem, et al.
Publicat: (2024)
Memory-Centric Computing: Recent Advances in Processing-in-DRAM
per: Mutlu, Onur, et al.
Publicat: (2024)
per: Mutlu, Onur, et al.
Publicat: (2024)
New Tools, Programming Models, and System Support for Processing-in-Memory Architectures
per: Oliveira, Geraldo F.
Publicat: (2025)
per: Oliveira, Geraldo F.
Publicat: (2025)
Efficient Architecture for RISC-V Vector Memory Access
per: Guan, Hongyi, et al.
Publicat: (2025)
per: Guan, Hongyi, et al.
Publicat: (2025)
UniFormer: Unified and Efficient Transformer for Reasoning Across General and Custom Computing
per: Ran, Zhuoheng, et al.
Publicat: (2025)
per: Ran, Zhuoheng, et al.
Publicat: (2025)
SpArch: Efficient Architecture for Sparse Matrix Multiplication
per: Zhang, Zhekai, et al.
Publicat: (2020)
per: Zhang, Zhekai, et al.
Publicat: (2020)
ALPHA-PIM: Analysis of Linear Algebraic Processing for High-Performance Graph Applications on a Real Processing-In-Memory System
per: Barkhordar, Marzieh, et al.
Publicat: (2026)
per: Barkhordar, Marzieh, et al.
Publicat: (2026)
SwarmIO: Towards 100 Million IOPS SSD Emulation for Next-generation GPU-centric Storage Systems
per: Kim, Hyeseong, et al.
Publicat: (2026)
per: Kim, Hyeseong, et al.
Publicat: (2026)
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
per: Lin, Bin, et al.
Publicat: (2024)
per: Lin, Bin, et al.
Publicat: (2024)
RAPID-Graph: Recursive All-Pairs Shortest Paths Using Processing-in-Memory for Dynamic Programming on Graphs
per: Chen, Yanru, et al.
Publicat: (2025)
per: Chen, Yanru, et al.
Publicat: (2025)
Survey of Disaggregated Memory: Cross-layer Technique Insights for Next-Generation Datacenters
per: Wang, Jing, et al.
Publicat: (2025)
per: Wang, Jing, et al.
Publicat: (2025)
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
per: Tian, Yuyang, et al.
Publicat: (2025)
per: Tian, Yuyang, et al.
Publicat: (2025)
Memory-Centric Computing: Solving Computing's Memory Problem
per: Mutlu, Onur, et al.
Publicat: (2025)
per: Mutlu, Onur, et al.
Publicat: (2025)
MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing
per: Oliveira, Geraldo F., et al.
Publicat: (2024)
per: Oliveira, Geraldo F., et al.
Publicat: (2024)
PIMDAL: Mitigating the Memory Bottleneck in Data Analytics using a Real Processing-in-Memory System
per: Frouzakis, Manos, et al.
Publicat: (2025)
per: Frouzakis, Manos, et al.
Publicat: (2025)
TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link
per: Zhang, Yichao, et al.
Publicat: (2026)
per: Zhang, Yichao, et al.
Publicat: (2026)
Analyzing a Two-Tier Disaggregated Memory Protection Scheme Based on Memory Replication
per: Volos, Haris, et al.
Publicat: (2025)
per: Volos, Haris, et al.
Publicat: (2025)
Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel
per: Liu, Yiqi, et al.
Publicat: (2026)
per: Liu, Yiqi, et al.
Publicat: (2026)
Microbenchmark-Driven Analytical Performance Modeling Across Modern GPU Architectures
per: Jarmusch, Aaron, et al.
Publicat: (2026)
per: Jarmusch, Aaron, et al.
Publicat: (2026)
Investigating Memory Failure Prediction Across CPU Architectures
per: Yu, Qiao, et al.
Publicat: (2024)
per: Yu, Qiao, et al.
Publicat: (2024)
PIUMA: Programmable Integrated Unified Memory Architecture
per: Aananthakrishnan, Sriram, et al.
Publicat: (2020)
per: Aananthakrishnan, Sriram, et al.
Publicat: (2020)
PhD Forum: Efficient Privacy-Preserving Processing via Memory-Centric Computing
per: Mwaisela, Mpoki
Publicat: (2024)
per: Mwaisela, Mpoki
Publicat: (2024)
ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs
per: Lei, Jianlong, et al.
Publicat: (2026)
per: Lei, Jianlong, et al.
Publicat: (2026)
FengHuang: Next-Generation Memory Orchestration for AI Inferencing
per: Li, Jiamin, et al.
Publicat: (2025)
per: Li, Jiamin, et al.
Publicat: (2025)
Ítems similars
-
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
per: Pan, Yudong, et al.
Publicat: (2026) -
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
per: Meng, William, et al.
Publicat: (2025) -
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
per: li, Fei, et al.
Publicat: (2026) -
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
per: Qin, Ruoyu, et al.
Publicat: (2024) -
Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
per: Yu, Yanpeng, et al.
Publicat: (2025)