Saved in:
| Main Authors: | Pan, Xiurui, Li, Endian, Li, Qiao, Liang, Shengwen, Shan, Yizhou, Zhou, Ke, Luo, Yingwei, Wang, Xiaolin, Zhang, Jie |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.04992 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Cost-Effective Near-Storage Processing Solution for Offline Inference of Long-Context LLMs
by: Jang, Hongsun, et al.
Published: (2025)
by: Jang, Hongsun, et al.
Published: (2025)
HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD
by: Sun, He, et al.
Published: (2026)
by: Sun, He, et al.
Published: (2026)
GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference
by: Chen, Guoci, et al.
Published: (2026)
by: Chen, Guoci, et al.
Published: (2026)
LLMulator: Generalizable Cost Modeling for Dataflow Accelerators with Input-Adaptive Control Flow
by: Chang, Kaiyan, et al.
Published: (2025)
by: Chang, Kaiyan, et al.
Published: (2025)
A Novel Extensible Simulation Framework for CXL-Enabled Systems
by: An, Yuda, et al.
Published: (2024)
by: An, Yuda, et al.
Published: (2024)
FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill
by: Jayanth, Rakshith, et al.
Published: (2026)
by: Jayanth, Rakshith, et al.
Published: (2026)
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
by: Wu, Haoran, et al.
Published: (2025)
by: Wu, Haoran, et al.
Published: (2025)
An RDMA-First Object Storage System with SmartNIC Offload
by: Zhu, Yu, et al.
Published: (2025)
by: Zhu, Yu, et al.
Published: (2025)
UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference
by: Xu, Weikai, et al.
Published: (2025)
by: Xu, Weikai, et al.
Published: (2025)
Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM
by: Yu, Zhongkai, et al.
Published: (2024)
by: Yu, Zhongkai, et al.
Published: (2024)
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
by: Lin, Bin, et al.
Published: (2024)
by: Lin, Bin, et al.
Published: (2024)
Knowledge-Guided Attention-Inspired Learning for Task Offloading in Vehicle Edge Computing
by: Ma, Ke, et al.
Published: (2025)
by: Ma, Ke, et al.
Published: (2025)
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)
by: Meng, William, et al.
Published: (2025)
SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency
by: Kyung, Kwanhee, et al.
Published: (2025)
by: Kyung, Kwanhee, et al.
Published: (2025)
Graphitron: A Domain Specific Language for FPGA-based Graph Processing Accelerator Generation
by: Zhang, Xinmiao, et al.
Published: (2024)
by: Zhang, Xinmiao, et al.
Published: (2024)
PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System
by: Kwon, Hyucksung, et al.
Published: (2024)
by: Kwon, Hyucksung, et al.
Published: (2024)
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
by: Liu, Qingyuan, et al.
Published: (2025)
by: Liu, Qingyuan, et al.
Published: (2025)
Lifecycle Cost-Effectiveness Modeling for Redundancy-Enhanced Multi-Chiplet Architectures
by: Liu, Zizhen, et al.
Published: (2026)
by: Liu, Zizhen, et al.
Published: (2026)
PD-Swap: Prefill-Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration
by: Zhang, Yifan, et al.
Published: (2025)
by: Zhang, Yifan, et al.
Published: (2025)
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
by: Hao, Mingbo, et al.
Published: (2026)
by: Hao, Mingbo, et al.
Published: (2026)
A Systematic Characterization of LLM Inference on GPUs
by: Wang, Haonan, et al.
Published: (2025)
by: Wang, Haonan, et al.
Published: (2025)
Convolutions Predictable Offloading to an Accelerator: Formalization and Optimization
by: Husson, Benjamin, et al.
Published: (2026)
by: Husson, Benjamin, et al.
Published: (2026)
TerEffic: Highly Efficient Ternary LLM Inference on FPGA
by: Yin, Chenyang, et al.
Published: (2025)
by: Yin, Chenyang, et al.
Published: (2025)
Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory
by: Hong, Jeongmin, et al.
Published: (2024)
by: Hong, Jeongmin, et al.
Published: (2024)
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
by: Liang, Yanbiao, et al.
Published: (2025)
by: Liang, Yanbiao, et al.
Published: (2025)
Mapping Space Exploration for Multi-Chiplet Accelerators Targeting LLM Inference Serving Workloads
by: Li, Boyu, et al.
Published: (2025)
by: Li, Boyu, et al.
Published: (2025)
Resilient and Secure Programmable System-on-Chip Accelerator Offload
by: Gouveia, Inês Pinto, et al.
Published: (2024)
by: Gouveia, Inês Pinto, et al.
Published: (2024)
OffRAC: Offloading Through Remote Accelerator Calls
by: Yang, Ziyi, et al.
Published: (2025)
by: Yang, Ziyi, et al.
Published: (2025)
Harmonia: Algorithm-Hardware Co-Design for Memory- and Compute-Efficient BFP-based LLM Inference
by: Wang, Xinyu, et al.
Published: (2026)
by: Wang, Xinyu, et al.
Published: (2026)
FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture
by: Xuan, Zihao, et al.
Published: (2026)
by: Xuan, Zihao, et al.
Published: (2026)
PermuteV: A Performant Side-channel-Resistant RISC-V Core Securing Edge AI Inference
by: Narkthong, Nuntipat, et al.
Published: (2025)
by: Narkthong, Nuntipat, et al.
Published: (2025)
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
by: Jiang, Aojie, et al.
Published: (2026)
by: Jiang, Aojie, et al.
Published: (2026)
CHIME: Chiplet-based Heterogeneous Near-Memory Acceleration for Edge Multimodal LLM Inference
by: Chen, Yanru, et al.
Published: (2025)
by: Chen, Yanru, et al.
Published: (2025)
Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts
by: Yun, Sungmin, et al.
Published: (2025)
by: Yun, Sungmin, et al.
Published: (2025)
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
by: Fan, Wang, et al.
Published: (2026)
by: Fan, Wang, et al.
Published: (2026)
Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure
by: Xie, Rui, et al.
Published: (2025)
by: Xie, Rui, et al.
Published: (2025)
FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Large Attention-Based Model Inference on Tile-Based Accelerators
by: Zhang, Chi, et al.
Published: (2026)
by: Zhang, Chi, et al.
Published: (2026)
Developing Cost-Effective Drones for 5G Non-Terrestrial Network Research and Experimentation
by: Cáceres, Carlos de Quinto, et al.
Published: (2024)
by: Cáceres, Carlos de Quinto, et al.
Published: (2024)
BitROM: Weight Reload-Free CiROM Architecture Towards Billion-Parameter 1.58-bit LLM Inference
by: Zhang, Wenlun, et al.
Published: (2025)
by: Zhang, Wenlun, et al.
Published: (2025)
T-MAN: Enabling End-to-End Low-Bit LLM Inference on NPUs via Unified Table Lookup
by: Wei, Jianyu, et al.
Published: (2025)
by: Wei, Jianyu, et al.
Published: (2025)
Similar Items
-
A Cost-Effective Near-Storage Processing Solution for Offline Inference of Long-Context LLMs
by: Jang, Hongsun, et al.
Published: (2025) -
HillInfer: Efficient Long-Context LLM Inference on the Edge with Hierarchical KV Eviction using SmartSSD
by: Sun, He, et al.
Published: (2026) -
GPU Acceleration of TFHE-Based High-Precision Nonlinear Layers for Encrypted LLM Inference
by: Chen, Guoci, et al.
Published: (2026) -
LLMulator: Generalizable Cost Modeling for Dataflow Accelerators with Input-Adaptive Control Flow
by: Chang, Kaiyan, et al.
Published: (2025) -
A Novel Extensible Simulation Framework for CXL-Enabled Systems
by: An, Yuda, et al.
Published: (2024)