:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kim, Changdae, Jin, Xianglan
Format:	Preprint
Published:	2025
Subjects:	Performance
Online Access:	https://arxiv.org/abs/2502.12592
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Pinching-Antenna Systems For Indoor Immersive Communications: A 3D-Modeling Based Performance Analysis
by: Wang, Yulei, et al.
Published: (2025)

On General Linearly Implicit Quantized State System Methods
by: Bergonzi, Mariana, et al.
Published: (2025)

H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference
by: Fu, Zizhuo, et al.
Published: (2025)

Energy-Efficient Software Development: A Multi-dimensional Empirical Analysis of Stack Overflow
by: Jin, Bihui, et al.
Published: (2024)

Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective
by: Benazir, Afsara, et al.
Published: (2025)

Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction
by: Chhugani, Jatin, et al.
Published: (2026)

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
by: Lin, Yujun, et al.
Published: (2024)

AI Work Quantization Model: Closed-System AI Computational Effort Metric
by: Sharma, Aasish Kumar, et al.
Published: (2025)

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)

A Novel Hybrid Optical and STAR IRS System for NTN Communications
by: Shang, Shunyuan, et al.
Published: (2025)

Beamforming-based Achievable Rate Maximization in ISAC System for Multi-UAV Networking
by: Zhou, Shengcai, et al.
Published: (2025)

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
by: Liu, Zirui, et al.
Published: (2024)

Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon
by: Lipshitz, Baraq, et al.
Published: (2025)

SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity
by: Zhou, Cyrus, et al.
Published: (2023)

FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices
by: Chai, Yuji, et al.
Published: (2025)

Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask
by: Abraham, Ashley N., et al.
Published: (2026)

Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems
by: Williams, Jeremy J., et al.
Published: (2026)

Efficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing Systems
by: Jali, Neharika, et al.
Published: (2024)

GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
by: Taneja, Maanas, et al.
Published: (2026)

EXAQ: Exponent Aware Quantization For LLMs Acceleration
by: Shkolnik, Moran, et al.
Published: (2024)

When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
by: Bergach, Mohamed Amine
Published: (2026)

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
by: Yao, Feiyu, et al.
Published: (2026)

Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing
by: Wang, Yuxin, et al.
Published: (2023)

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
by: Chu, Kexin, et al.
Published: (2025)

GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference
by: Ziller, Thomas, et al.
Published: (2026)

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
by: Li, Jianhui, et al.
Published: (2023)

Scaler: Efficient and Effective Cross Flow Analysis
by: Steven, et al.
Published: (2024)

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
by: Zandieh, Amir, et al.
Published: (2024)

Time-Efficient Hybrid Hyperparameter Tuning Approach for Cardiovascular Disease Classification
by: Pathak, Abhay Kumar, et al.
Published: (2024)

HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
by: Huang, Haochen, et al.
Published: (2025)

AR-PPF: Advanced Resolution-Based Pixel Preemption Data Filtering for Efficient Time-Series Data Analysis
by: Kim, Taewoong, et al.
Published: (2024)

A2Q+: Improving Accumulator-Aware Weight Quantization
by: Colbert, Ian, et al.
Published: (2024)

Efficient Data-Driven Production Scheduling in Pharmaceutical Manufacturing
by: Balatsos, Ioannis, et al.
Published: (2026)

Mosaic: Cross-Modal Clustering for Efficient Video Understanding
by: Wang, Tuowei, et al.
Published: (2026)

Reducing Waiting Time for Medical Tourists Through Hybrid Agent-Based and Discrete-Event Simulation: A Hospital Case Study
by: Baghi, Melika, et al.
Published: (2026)

Towards Efficient Multi-Scale Deformable Attention on NPU
by: Huang, Chenghuan, et al.
Published: (2025)

Accurate Performance Modeling And Uncertainty Analysis of Lossy Compression in Scientific Applications
by: Liu, Youyuan, et al.
Published: (2024)

Resource-Efficient RGB-Only Action Recognition for Edge Deployment
by: Yoon, Dongsik, et al.
Published: (2026)

PerfSeer: An Efficient and Accurate Deep Learning Models Performance Predictor
by: Zhao, Xinlong, et al.
Published: (2025)

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator
by: Ham, Hyungkyu, et al.
Published: (2024)