Guardado en:
| Autores principales: | Chen, Peilin, Yang, Xiaoxuan |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2505.17787 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Optimizing and Exploring System Performance in Compact Processing-in-Memory-based Chips
por: Chen, Peilin, et al.
Publicado: (2025)
por: Chen, Peilin, et al.
Publicado: (2025)
End-to-End Transformer Acceleration Through Processing-in-Memory Architectures
por: Yang, Xiaoxuan, et al.
Publicado: (2025)
por: Yang, Xiaoxuan, et al.
Publicado: (2025)
VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator
por: Wang, Zhican, et al.
Publicado: (2025)
por: Wang, Zhican, et al.
Publicado: (2025)
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
por: Kim, Minsu, et al.
Publicado: (2025)
por: Kim, Minsu, et al.
Publicado: (2025)
Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching
por: Gao, Hanyuan, et al.
Publicado: (2026)
por: Gao, Hanyuan, et al.
Publicado: (2026)
UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference
por: Xu, Weikai, et al.
Publicado: (2025)
por: Xu, Weikai, et al.
Publicado: (2025)
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
por: Yao, Jiayi, et al.
Publicado: (2026)
por: Yao, Jiayi, et al.
Publicado: (2026)
Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
por: Fang, Yunhua, et al.
Publicado: (2025)
por: Fang, Yunhua, et al.
Publicado: (2025)
Comparative Characterization of KV Cache Management Strategies for LLM Inference
por: Mamo, Oteo, et al.
Publicado: (2026)
por: Mamo, Oteo, et al.
Publicado: (2026)
Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
por: Kim, Dowon, et al.
Publicado: (2025)
por: Kim, Dowon, et al.
Publicado: (2025)
SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding
por: Zhang, Junming, et al.
Publicado: (2026)
por: Zhang, Junming, et al.
Publicado: (2026)
Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
por: Xia, Tianhua, et al.
Publicado: (2025)
por: Xia, Tianhua, et al.
Publicado: (2025)
Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
por: li, Fei, et al.
Publicado: (2026)
por: li, Fei, et al.
Publicado: (2026)
Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
por: Zheng, Xianzhe, et al.
Publicado: (2026)
por: Zheng, Xianzhe, et al.
Publicado: (2026)
Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems
por: Yamamoto, Yuji, et al.
Publicado: (2026)
por: Yamamoto, Yuji, et al.
Publicado: (2026)
EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture
por: Duan, Bowen, et al.
Publicado: (2026)
por: Duan, Bowen, et al.
Publicado: (2026)
AutoRAC: Automated Processing-in-Memory Accelerator Design for Recommender Systems
por: Cheng, Feng, et al.
Publicado: (2025)
por: Cheng, Feng, et al.
Publicado: (2025)
Reconfigurable Digital RRAM Logic Enables In-Situ Pruning and Learning for Edge AI
por: Wang, Songqi, et al.
Publicado: (2025)
por: Wang, Songqi, et al.
Publicado: (2025)
FireFly-P: FPGA-Accelerated Spiking Neural Network Plasticity for Robust Adaptive Control
por: Li, Tenglong, et al.
Publicado: (2026)
por: Li, Tenglong, et al.
Publicado: (2026)
MERE: Hardware-Software Co-Design for Masking Cache Miss Latency in Embedded Processors
por: You, Dean, et al.
Publicado: (2025)
por: You, Dean, et al.
Publicado: (2025)
The Avatar Cache: Enabling On-Demand Security with Morphable Cache Architecture
por: Bhatla, Anubhav, et al.
Publicado: (2026)
por: Bhatla, Anubhav, et al.
Publicado: (2026)
Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference
por: Wolters, Christopher, et al.
Publicado: (2024)
por: Wolters, Christopher, et al.
Publicado: (2024)
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
por: Yüzügüler, Ahmet Caner, et al.
Publicado: (2025)
por: Yüzügüler, Ahmet Caner, et al.
Publicado: (2025)
VersaQ-3D: A Reconfigurable Accelerator Enabling Feed-Forward and Generalizable 3D Reconstruction via Versatile Quantization
por: Zhang, Yipu, et al.
Publicado: (2026)
por: Zhang, Yipu, et al.
Publicado: (2026)
PiKV: KV Cache Management System for Mixture of Experts
por: Liu, Dong, et al.
Publicado: (2025)
por: Liu, Dong, et al.
Publicado: (2025)
FireFly-T: High-Throughput Sparsity Exploitation for Spiking Transformer Acceleration with Dual-Engine Overlay Architecture
por: Li, Tenglong, et al.
Publicado: (2025)
por: Li, Tenglong, et al.
Publicado: (2025)
FireFly-S: Exploiting Dual-Side Sparsity for Spiking Neural Networks Acceleration with Reconfigurable Spatial Architecture
por: Li, Tenglong, et al.
Publicado: (2024)
por: Li, Tenglong, et al.
Publicado: (2024)
Arcalis: Accelerating Remote Procedure Calls Using a Lightweight Near-Cache Solution
por: Umeike, Johnson, et al.
Publicado: (2026)
por: Umeike, Johnson, et al.
Publicado: (2026)
DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing
por: Xu, Yansong, et al.
Publicado: (2024)
por: Xu, Yansong, et al.
Publicado: (2024)
MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization
por: Kim, Seeyeon, et al.
Publicado: (2026)
por: Kim, Seeyeon, et al.
Publicado: (2026)
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval
por: Kim, Donghyuk, et al.
Publicado: (2025)
por: Kim, Donghyuk, et al.
Publicado: (2025)
HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD
por: Sun, He, et al.
Publicado: (2026)
por: Sun, He, et al.
Publicado: (2026)
DCI: A Coordinated Allocation and Filling Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System
por: Luo, Yi, et al.
Publicado: (2025)
por: Luo, Yi, et al.
Publicado: (2025)
SPARQLe: Sub-Precision Activation Representation for Quantized LLM Inference
por: Parvathy, Aradhana Mohan, et al.
Publicado: (2026)
por: Parvathy, Aradhana Mohan, et al.
Publicado: (2026)
FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
por: Qiao, Ye, et al.
Publicado: (2026)
por: Qiao, Ye, et al.
Publicado: (2026)
Accelerating PoT Quantization on Edge Devices
por: Saha, Rappy, et al.
Publicado: (2024)
por: Saha, Rappy, et al.
Publicado: (2024)
FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference
por: Hooper, Coleman, et al.
Publicado: (2025)
por: Hooper, Coleman, et al.
Publicado: (2025)
DiSC: Resolution-Scalable Acceleration of Diffusion Models by Exploiting Sparsity and Cached Token Reuse with Hash-based Distribution
por: Yoon, Jieon, et al.
Publicado: (2026)
por: Yoon, Jieon, et al.
Publicado: (2026)
D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs
por: Abdelmaksoud, Ahmed J., et al.
Publicado: (2026)
por: Abdelmaksoud, Ahmed J., et al.
Publicado: (2026)
FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture
por: Xuan, Zihao, et al.
Publicado: (2026)
por: Xuan, Zihao, et al.
Publicado: (2026)
Ejemplares similares
-
Optimizing and Exploring System Performance in Compact Processing-in-Memory-based Chips
por: Chen, Peilin, et al.
Publicado: (2025) -
End-to-End Transformer Acceleration Through Processing-in-Memory Architectures
por: Yang, Xiaoxuan, et al.
Publicado: (2025) -
VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator
por: Wang, Zhican, et al.
Publicado: (2025) -
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
por: Kim, Minsu, et al.
Publicado: (2025) -
Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching
por: Gao, Hanyuan, et al.
Publicado: (2026)