:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Skliar, Andrii, van Rozendaal, Ties, Lepert, Romain, Boinovski, Todor, van Baalen, Mart, Nagel, Markus, Whatmough, Paul, Bejnordi, Babak Ehteshami
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Machine Learning Artificial Intelligence Hardware Architecture
Online-Zugang:	https://arxiv.org/abs/2412.00099
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
von: Federici, Marco, et al.
Veröffentlicht: (2024)

KaVa: Latent Reasoning via Compressed KV-Cache Distillation
von: Kuzina, Anna, et al.
Veröffentlicht: (2025)

Leech Lattice Vector Quantization for Efficient LLM Compression
von: van der Ouderaa, Tycho F. A., et al.
Veröffentlicht: (2026)

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
von: Bergner, Benjamin, et al.
Veröffentlicht: (2024)

Area-Efficient In-Memory Computing for Mixture-of-Experts via Multiplexing and Caching
von: Gao, Hanyuan, et al.
Veröffentlicht: (2026)

Dirichlet-Prior Shaping: Guiding Expert Specialization in Upcycled MoEs
von: Mirvakhabova, Leyla, et al.
Veröffentlicht: (2025)

NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference
von: Xu, Weikai, et al.
Veröffentlicht: (2026)

PiKV: KV Cache Management System for Mixture of Experts
von: Liu, Dong, et al.
Veröffentlicht: (2025)

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference
von: Choi, Yuseon, et al.
Veröffentlicht: (2025)

Scaling Multi-Node Mixture-of-Experts Inference Using Expert Activation Patterns
von: Bambhaniya, Abhimanyu, et al.
Veröffentlicht: (2026)

Rethinking LLM Inference Bottlenecks: Insights from Latent Attention and Mixture-of-Experts
von: Yun, Sungmin, et al.
Veröffentlicht: (2025)

Context-Aware Mixture-of-Experts Inference on CXL-Enabled GPU-NDP Systems
von: Fan, Zehao, et al.
Veröffentlicht: (2025)

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
von: Kim, Jungwoo, et al.
Veröffentlicht: (2026)

Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing
von: Khadem, Alireza, et al.
Veröffentlicht: (2025)

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
von: Yun, Sungmin, et al.
Veröffentlicht: (2024)

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
von: Yao, Jiayi, et al.
Veröffentlicht: (2026)

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator
von: Tyagi, Abhishek, et al.
Veröffentlicht: (2024)

Nemo: A Low-Write-Amplification Cache for Tiny Objects on Log-Structured Flash Devices
von: Yang, Xufeng, et al.
Veröffentlicht: (2026)

SSD Offloading for LLM Mixture-of-Experts Weights Considered Harmful in Energy Efficiency
von: Kyung, Kwanhee, et al.
Veröffentlicht: (2025)

Pruning vs Quantization: Which is Better?
von: Kuzmin, Andrey, et al.
Veröffentlicht: (2023)

Comparative Characterization of KV Cache Management Strategies for LLM Inference
von: Mamo, Oteo, et al.
Veröffentlicht: (2026)

Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
von: Hwang, Ranggi, et al.
Veröffentlicht: (2023)

GPTVQ: The Blessing of Dimensionality for LLM Quantization
von: van Baalen, Mart, et al.
Veröffentlicht: (2024)

CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA
von: Dong, Jiale, et al.
Veröffentlicht: (2025)

DCI: A Coordinated Allocation and Filling Workload-Aware Dual-Cache Allocation GNN Inference Acceleration System
von: Luo, Yi, et al.
Veröffentlicht: (2025)

I/O Transit Caching for PMem-based Block Device
von: Xu, Qing, et al.
Veröffentlicht: (2024)

Thales: Formulating and Estimating Architectural Vulnerability Factors for DNN Accelerators
von: Tyagi, Abhishek, et al.
Veröffentlicht: (2022)

UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA
von: Dong, Jiale, et al.
Veröffentlicht: (2025)

A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration
von: Huang, Wei-Hsing, et al.
Veröffentlicht: (2025)

UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference
von: Xu, Weikai, et al.
Veröffentlicht: (2025)

AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures
von: Shende, Omkar B, et al.
Veröffentlicht: (2026)

ICGMM: CXL-enabled Memory Expansion with Intelligent Caching Using Gaussian Mixture Model
von: Chen, Hanqiu, et al.
Veröffentlicht: (2024)

Accelerating Boolean Constraint Propagation for Efficient SAT-Solving on FPGAs
von: Govindasamy, Hariprasadh, et al.
Veröffentlicht: (2024)

BackCache: Mitigating Contention-Based Cache Timing Attacks by Hiding Cache Line Evictions
von: Wang, Quancheng, et al.
Veröffentlicht: (2023)

Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis
von: Li, Weikai, et al.
Veröffentlicht: (2024)

Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM
von: Yu, Zhongkai, et al.
Veröffentlicht: (2024)

VitaLLM: A Versatile and Tiny Accelerator for Mixed-Precision LLM Inference on Edge Devices
von: Lin, Zi-Wei, et al.
Veröffentlicht: (2026)

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
von: Hao, Mingbo, et al.
Veröffentlicht: (2026)

Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System
von: Fang, Yunhua, et al.
Veröffentlicht: (2025)

The Avatar Cache: Enabling On-Demand Security with Morphable Cache Architecture
von: Bhatla, Anubhav, et al.
Veröffentlicht: (2026)