:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Chen, Peilin, Yang, Xiaoxuan
Formato:	Preprint
Publicado:	2025
Materias:	Hardware Architecture
Acceso en línea:	https://arxiv.org/abs/2505.17787
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Optimizing and Exploring System Performance in Compact Processing-in-Memory-based Chips
por: Chen, Peilin, et al.
Publicado: (2025)

End-to-End Transformer Acceleration Through Processing-in-Memory Architectures
por: Yang, Xiaoxuan, et al.
Publicado: (2025)

VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator
por: Wang, Zhican, et al.
Publicado: (2025)

Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
por: Kim, Minsu, et al.
Publicado: (2025)

Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching
por: Gao, Hanyuan, et al.
Publicado: (2026)

UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference
por: Xu, Weikai, et al.
Publicado: (2025)

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
por: Yao, Jiayi, et al.
Publicado: (2026)

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
por: Fang, Yunhua, et al.
Publicado: (2025)

Comparative Characterization of KV Cache Management Strategies for LLM Inference
por: Mamo, Oteo, et al.
Publicado: (2026)

Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits
por: Kim, Dowon, et al.
Publicado: (2025)

SwiftKV: An Edge-Oriented Attention Algorithm and Multi-Head Accelerator for Fast, Efficient LLM Decoding
por: Zhang, Junming, et al.
Publicado: (2026)

Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing
por: Xia, Tianhua, et al.
Publicado: (2025)

Adaptive KV Cache Reuse for Fast Long-Context LLM Serving
por: li, Fei, et al.
Publicado: (2026)

Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service
por: Zheng, Xianzhe, et al.
Publicado: (2026)

Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems
por: Yamamoto, Yuji, et al.
Publicado: (2026)

EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture
por: Duan, Bowen, et al.
Publicado: (2026)

AutoRAC: Automated Processing-in-Memory Accelerator Design for Recommender Systems
por: Cheng, Feng, et al.
Publicado: (2025)

Reconfigurable Digital RRAM Logic Enables In-Situ Pruning and Learning for Edge AI
por: Wang, Songqi, et al.
Publicado: (2025)

FireFly-P: FPGA-Accelerated Spiking Neural Network Plasticity for Robust Adaptive Control
por: Li, Tenglong, et al.
Publicado: (2026)

MERE: Hardware-Software Co-Design for Masking Cache Miss Latency in Embedded Processors
por: You, Dean, et al.
Publicado: (2025)

The Avatar Cache: Enabling On-Demand Security with Morphable Cache Architecture
por: Bhatla, Anubhav, et al.
Publicado: (2026)

Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference
por: Wolters, Christopher, et al.
Publicado: (2024)

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
por: Yüzügüler, Ahmet Caner, et al.
Publicado: (2025)

VersaQ-3D: A Reconfigurable Accelerator Enabling Feed-Forward and Generalizable 3D Reconstruction via Versatile Quantization
por: Zhang, Yipu, et al.
Publicado: (2026)

PiKV: KV Cache Management System for Mixture of Experts
por: Liu, Dong, et al.
Publicado: (2025)

FireFly-T: High-Throughput Sparsity Exploitation for Spiking Transformer Acceleration with Dual-Engine Overlay Architecture
por: Li, Tenglong, et al.
Publicado: (2025)

FireFly-S: Exploiting Dual-Side Sparsity for Spiking Neural Networks Acceleration with Reconfigurable Spatial Architecture
por: Li, Tenglong, et al.
Publicado: (2024)

Arcalis: Accelerating Remote Procedure Calls Using a Lightweight Near-Cache Solution
por: Umeike, Johnson, et al.
Publicado: (2026)

DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing
por: Xu, Yansong, et al.
Publicado: (2024)

MASQ: Accelerating Masked Diffusion via Stage-Wise Multi-Precision Quantization
por: Kim, Seeyeon, et al.
Publicado: (2026)

V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval
por: Kim, Donghyuk, et al.
Publicado: (2025)

HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD
por: Sun, He, et al.
Publicado: (2026)

DCI: A Coordinated Allocation and Filling Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System
por: Luo, Yi, et al.
Publicado: (2025)

SPARQLe: Sub-Precision Activation Representation for Quantized LLM Inference
por: Parvathy, Aradhana Mohan, et al.
Publicado: (2026)

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression
por: Qiao, Ye, et al.
Publicado: (2026)

Accelerating PoT Quantization on Edge Devices
por: Saha, Rappy, et al.
Publicado: (2024)

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference
por: Hooper, Coleman, et al.
Publicado: (2025)

DiSC: Resolution-Scalable Acceleration of Diffusion Models by Exploiting Sparsity and Cached Token Reuse with Hash-based Distribution
por: Yoon, Jieon, et al.
Publicado: (2026)

D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs
por: Abdelmaksoud, Ahmed J., et al.
Publicado: (2026)

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture
por: Xuan, Zihao, et al.
Publicado: (2026)