Saved in:
| Main Authors: | Kim, Changdae, Jin, Xianglan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.12592 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Pinching-Antenna Systems For Indoor Immersive Communications: A 3D-Modeling Based Performance Analysis
by: Wang, Yulei, et al.
Published: (2025)
by: Wang, Yulei, et al.
Published: (2025)
On General Linearly Implicit Quantized State System Methods
by: Bergonzi, Mariana, et al.
Published: (2025)
by: Bergonzi, Mariana, et al.
Published: (2025)
H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference
by: Fu, Zizhuo, et al.
Published: (2025)
by: Fu, Zizhuo, et al.
Published: (2025)
Energy-Efficient Software Development: A Multi-dimensional Empirical Analysis of Stack Overflow
by: Jin, Bihui, et al.
Published: (2024)
by: Jin, Bihui, et al.
Published: (2024)
Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective
by: Benazir, Afsara, et al.
Published: (2025)
by: Benazir, Afsara, et al.
Published: (2025)
Unveiling the Potential of Quantization with MXFP4: Strategies for Quantization Error Reduction
by: Chhugani, Jatin, et al.
Published: (2026)
by: Chhugani, Jatin, et al.
Published: (2026)
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
by: Lin, Yujun, et al.
Published: (2024)
by: Lin, Yujun, et al.
Published: (2024)
AI Work Quantization Model: Closed-System AI Computational Effort Metric
by: Sharma, Aasish Kumar, et al.
Published: (2025)
by: Sharma, Aasish Kumar, et al.
Published: (2025)
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
by: Lin, Mao, et al.
Published: (2026)
by: Lin, Mao, et al.
Published: (2026)
A Novel Hybrid Optical and STAR IRS System for NTN Communications
by: Shang, Shunyuan, et al.
Published: (2025)
by: Shang, Shunyuan, et al.
Published: (2025)
Beamforming-based Achievable Rate Maximization in ISAC System for Multi-UAV Networking
by: Zhou, Shengcai, et al.
Published: (2025)
by: Zhou, Shengcai, et al.
Published: (2025)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
by: Liu, Zirui, et al.
Published: (2024)
by: Liu, Zirui, et al.
Published: (2024)
Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon
by: Lipshitz, Baraq, et al.
Published: (2025)
by: Lipshitz, Baraq, et al.
Published: (2025)
SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity
by: Zhou, Cyrus, et al.
Published: (2023)
by: Zhou, Cyrus, et al.
Published: (2023)
FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices
by: Chai, Yuji, et al.
Published: (2025)
by: Chai, Yuji, et al.
Published: (2025)
Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask
by: Abraham, Ashley N., et al.
Published: (2026)
by: Abraham, Ashley N., et al.
Published: (2026)
Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations for Exascale Computing Systems
by: Williams, Jeremy J., et al.
Published: (2026)
by: Williams, Jeremy J., et al.
Published: (2026)
Efficient Reinforcement Learning for Routing Jobs in Heterogeneous Queueing Systems
by: Jali, Neharika, et al.
Published: (2024)
by: Jali, Neharika, et al.
Published: (2024)
GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models
by: Taneja, Maanas, et al.
Published: (2026)
by: Taneja, Maanas, et al.
Published: (2026)
EXAQ: Exponent Aware Quantization For LLMs Acceleration
by: Shkolnik, Moran, et al.
Published: (2024)
by: Shkolnik, Moran, et al.
Published: (2024)
When Quantization Is Free: An int4 KV Cache That Outruns fp16 on Apple Silicon
by: Bergach, Mohamed Amine
Published: (2026)
by: Bergach, Mohamed Amine
Published: (2026)
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
by: Yao, Feiyu, et al.
Published: (2026)
by: Yao, Feiyu, et al.
Published: (2026)
Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing
by: Wang, Yuxin, et al.
Published: (2023)
by: Wang, Yuxin, et al.
Published: (2023)
Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
by: Chu, Kexin, et al.
Published: (2025)
by: Chu, Kexin, et al.
Published: (2025)
GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference
by: Ziller, Thomas, et al.
Published: (2026)
by: Ziller, Thomas, et al.
Published: (2026)
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
by: Li, Jianhui, et al.
Published: (2023)
by: Li, Jianhui, et al.
Published: (2023)
Scaler: Efficient and Effective Cross Flow Analysis
by: Steven, et al.
Published: (2024)
by: Steven, et al.
Published: (2024)
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
by: Zandieh, Amir, et al.
Published: (2024)
by: Zandieh, Amir, et al.
Published: (2024)
Time-Efficient Hybrid Hyperparameter Tuning Approach for Cardiovascular Disease Classification
by: Pathak, Abhay Kumar, et al.
Published: (2024)
by: Pathak, Abhay Kumar, et al.
Published: (2024)
HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
by: Huang, Haochen, et al.
Published: (2025)
by: Huang, Haochen, et al.
Published: (2025)
AR-PPF: Advanced Resolution-Based Pixel Preemption Data Filtering for Efficient Time-Series Data Analysis
by: Kim, Taewoong, et al.
Published: (2024)
by: Kim, Taewoong, et al.
Published: (2024)
A2Q+: Improving Accumulator-Aware Weight Quantization
by: Colbert, Ian, et al.
Published: (2024)
by: Colbert, Ian, et al.
Published: (2024)
Efficient Data-Driven Production Scheduling in Pharmaceutical Manufacturing
by: Balatsos, Ioannis, et al.
Published: (2026)
by: Balatsos, Ioannis, et al.
Published: (2026)
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
by: Wang, Tuowei, et al.
Published: (2026)
by: Wang, Tuowei, et al.
Published: (2026)
Reducing Waiting Time for Medical Tourists Through Hybrid Agent-Based and Discrete-Event Simulation: A Hospital Case Study
by: Baghi, Melika, et al.
Published: (2026)
by: Baghi, Melika, et al.
Published: (2026)
Towards Efficient Multi-Scale Deformable Attention on NPU
by: Huang, Chenghuan, et al.
Published: (2025)
by: Huang, Chenghuan, et al.
Published: (2025)
Accurate Performance Modeling And Uncertainty Analysis of Lossy Compression in Scientific Applications
by: Liu, Youyuan, et al.
Published: (2024)
by: Liu, Youyuan, et al.
Published: (2024)
Resource-Efficient RGB-Only Action Recognition for Edge Deployment
by: Yoon, Dongsik, et al.
Published: (2026)
by: Yoon, Dongsik, et al.
Published: (2026)
PerfSeer: An Efficient and Accurate Deep Learning Models Performance Predictor
by: Zhao, Xinlong, et al.
Published: (2025)
by: Zhao, Xinlong, et al.
Published: (2025)
ONNXim: A Fast, Cycle-level Multi-core NPU Simulator
by: Ham, Hyungkyu, et al.
Published: (2024)
by: Ham, Hyungkyu, et al.
Published: (2024)
Similar Items
-
Pinching-Antenna Systems For Indoor Immersive Communications: A 3D-Modeling Based Performance Analysis
by: Wang, Yulei, et al.
Published: (2025) -
On General Linearly Implicit Quantized State System Methods
by: Bergonzi, Mariana, et al.
Published: (2025) -
H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference
by: Fu, Zizhuo, et al.
Published: (2025) -
Energy-Efficient Software Development: A Multi-dimensional Empirical Analysis of Stack Overflow
by: Jin, Bihui, et al.
Published: (2024) -
Profiling Large Language Model Inference on Apple Silicon: A Quantization Perspective
by: Benazir, Afsara, et al.
Published: (2025)