:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Pan, Xiurui, Li, Endian, Li, Qiao, Liang, Shengwen, Shan, Yizhou, Zhou, Ke, Luo, Yingwei, Wang, Xiaolin, Zhang, Jie
Format:	Preprint
Published:	2024
Subjects:	Hardware Architecture Computation and Language
Online Access:	https://arxiv.org/abs/2409.04992
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Cost-Effective Near-Storage Processing Solution for Offline Inference of Long-Context LLMs
by: Jang, Hongsun, et al.
Published: (2025)

HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD
by: Sun, He, et al.
Published: (2026)

GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference
by: Chen, Guoci, et al.
Published: (2026)

LLMulator: Generalizable Cost Modeling for Dataflow Accelerators with Input-Adaptive Control Flow
by: Chang, Kaiyan, et al.
Published: (2025)

A Novel Extensible Simulation Framework for CXL-Enabled Systems
by: An, Yuda, et al.
Published: (2024)

FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill
by: Jayanth, Rakshith, et al.
Published: (2026)

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
by: Wu, Haoran, et al.
Published: (2025)

An RDMA-First Object Storage System with SmartNIC Offload
by: Zhu, Yu, et al.
Published: (2025)

UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference
by: Xu, Weikai, et al.
Published: (2025)

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM
by: Yu, Zhongkai, et al.
Published: (2024)

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
by: Lin, Bin, et al.
Published: (2024)

Knowledge-Guided Attention-Inspired Learning for Task Offloading in Vehicle Edge Computing
by: Ma, Ke, et al.
Published: (2025)

Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)

SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency
by: Kyung, Kwanhee, et al.
Published: (2025)

Graphitron: A Domain Specific Language for FPGA-based Graph Processing Accelerator Generation
by: Zhang, Xinmiao, et al.
Published: (2024)

PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System
by: Kwon, Hyucksung, et al.
Published: (2024)

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
by: Liu, Qingyuan, et al.
Published: (2025)

Lifecycle Cost-Effectiveness Modeling for Redundancy-Enhanced Multi-Chiplet Architectures
by: Liu, Zizhen, et al.
Published: (2026)

PD-Swap: Prefill-Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration
by: Zhang, Yifan, et al.
Published: (2025)

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
by: Hao, Mingbo, et al.
Published: (2026)

A Systematic Characterization of LLM Inference on GPUs
by: Wang, Haonan, et al.
Published: (2025)

Convolutions Predictable Offloading to an Accelerator: Formalization and Optimization
by: Husson, Benjamin, et al.
Published: (2026)

TerEffic: Highly Efficient Ternary LLM Inference on FPGA
by: Yin, Chenyang, et al.
Published: (2025)

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory
by: Hong, Jeongmin, et al.
Published: (2024)

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
by: Liang, Yanbiao, et al.
Published: (2025)

Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads
by: Li, Boyu, et al.
Published: (2025)

Resilient and Secure Programmable System-on-Chip Accelerator Offload
by: Gouveia, Inês Pinto, et al.
Published: (2024)

OffRAC: Offloading Through Remote Accelerator Calls
by: Yang, Ziyi, et al.
Published: (2025)

Harmonia: Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference
by: Wang, Xinyu, et al.
Published: (2026)

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture
by: Xuan, Zihao, et al.
Published: (2026)

PermuteV: A Performant Side-channel-Resistant RISC-V Core Securing Edge AI Inference
by: Narkthong, Nuntipat, et al.
Published: (2025)

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
by: Jiang, Aojie, et al.
Published: (2026)

CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference
by: Chen, Yanru, et al.
Published: (2025)

Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts
by: Yun, Sungmin, et al.
Published: (2025)

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
by: Fan, Wang, et al.
Published: (2026)

Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure
by: Xie, Rui, et al.
Published: (2025)

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators
by: Zhang, Chi, et al.
Published: (2026)

Developing Cost-Effective Drones for 5G Non-Terrestrial Network Research and Experimentation
by: Cáceres, Carlos de Quinto, et al.
Published: (2024)

BitROM: Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference
by: Zhang, Wenlun, et al.
Published: (2025)

T-MAN: Enabling End-to-End Low-Bit LLM Inference on NPUs via Unified Table Lookup
by: Wei, Jianyu, et al.
Published: (2025)