:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Luo, Yi, Wang, Yaobin, Wang, Qi, Song, Yingchen, Wu, Huan, Wang, Qingfeng, Huang, Jun
Format:	Preprint
Published:	2025
Subjects:	Hardware Architecture
Online Access:	https://arxiv.org/abs/2503.01281
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems
by: Gundawar, Ayush, et al.
Published: (2024)

Adaptive Cache Pollution Control for Large Language Model Inference Workloads Using Temporal CNN-Based Prediction and Priority-Aware Replacement
by: Liu, Songze, et al.
Published: (2025)

GreenMalloc: Allocator Optimisation for Industrial Workloads
by: Dakhama, Aidan, et al.
Published: (2025)

Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads
by: Li, Boyu, et al.
Published: (2025)

Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems
by: Ge, Mengke, et al.
Published: (2024)

ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
by: Zou, Guoqiang, et al.
Published: (2025)

Accelerating GNN Training through Locality-aware Dropout and Merge
by: Sun, Gongjian, et al.
Published: (2025)

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
by: Fang, Yunhua, et al.
Published: (2025)

Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification
by: Patel, Vihaan, et al.
Published: (2026)

SpeedLLM: An FPGA Co-design of Large Language Model Inference Accelerator
by: Wang, Peipei, et al.
Published: (2025)

High Utilization Energy-Aware Real-Time Inference Deep Convolutional Neural Network Accelerator
by: Lin, Kuan-Ting, et al.
Published: (2025)

NOVA: Coordinated Test Selection and Bayes-Optimized Constrained Randomization for Accelerated Coverage Closure
by: Peng, Weijie, et al.
Published: (2025)

Instruction-Based Coordination of Heterogeneous Processing Units for Acceleration of DNN Inference
by: Petropoulos, Anastasios, et al.
Published: (2025)

HPIM: Heterogeneous Processing-In-Memory-based Accelerator for Large Language Models Inference
by: Duan, Cenlin, et al.
Published: (2025)

VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator
by: Wang, Zhican, et al.
Published: (2025)

ApproxPilot: A GNN-based Accelerator Approximation Framework
by: Zhang, Qing, et al.
Published: (2024)

Communication Characterization of AI Workloads for Large-scale Multi-chiplet Accelerators
by: Musavi, Mariam, et al.
Published: (2024)

Garibaldi: A Pairwise Instruction-Data Management for Enhancing Shared Last-Level Cache Performance in Server Workloads
by: Kwon, Jaewon, et al.
Published: (2025)

Messaging-based Adaptive Vector Computing (MAVeC) Accelerator for AI Workloads
by: Chowdhury, Md. Rownak Hossain, et al.
Published: (2024)

A Dynamic Allocation Scheme for Adaptive Shared-Memory Mapping on Kilo-core RV Clusters for Attention-Based Model Deployment
by: Wang, Bowen, et al.
Published: (2025)

Aging Aware Adaptive Voltage Scaling for Reliable and Efficient AI Accelerators
by: Xie, Tong, et al.
Published: (2026)

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
by: Yao, Jiayi, et al.
Published: (2026)

Integrating Prefetcher Selection with Dynamic Request Allocation Improves Prefetching Efficiency
by: Li, Mengming, et al.
Published: (2025)

Be CIM or Be Memory: A Dual-mode-aware DNN Compiler for CIM Accelerators
by: Zhao, Shixin, et al.
Published: (2025)

PREFENDER: A Prefetching Defender against Cache Side Channel Attacks as A Pretender
by: Li, Luyi, et al.
Published: (2023)

Multi-Objective Hardware-Mapping Co-Optimisation for Multi-DNN Workloads on Chiplet-based Accelerators
by: Das, Abhijit, et al.
Published: (2022)

SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling
by: Wang, Huizheng, et al.
Published: (2024)

SpecMamba: Accelerating Mamba Inference on FPGA with Speculative Decoding
by: Zhong, Linfeng, et al.
Published: (2025)

PRIMAL: Processing-In-Memory Based Low-Rank Adaptation for LLM Inference Accelerator
by: Chong, Yue Jiet, et al.
Published: (2026)

Titanus: Enabling KV Cache Pruning and Quantization On-the-Fly for LLM Acceleration
by: Chen, Peilin, et al.
Published: (2025)

Comparative Characterization of KV Cache Management Strategies for LLM Inference
by: Mamo, Oteo, et al.
Published: (2026)

BackCache: Mitigating Contention-Based Cache Timing Attacks by Hiding Cache Line Evictions
by: Wang, Quancheng, et al.
Published: (2023)

Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training
by: Adnan, Muhammad, et al.
Published: (2024)

HCiM: ADC-Less Hybrid Analog-Digital Compute in Memory Accelerator for Deep Learning Workloads
by: Negi, Shubham, et al.
Published: (2024)

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference
by: Choi, Yuseon, et al.
Published: (2025)

MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation
by: Park, Dahoon, et al.
Published: (2026)

TENET: An Efficient Sparsity-Aware LUT-Centric Architecture for Ternary LLM Inference On Edge
by: Huang, Zhirui, et al.
Published: (2025)

Low Latency GNN Accelerator for Quantum Error Correction
by: Cicero, Alessio, et al.
Published: (2026)

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness
by: Wang, Huizheng, et al.
Published: (2025)

UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference
by: Xu, Weikai, et al.
Published: (2025)