Saved in:
| Main Authors: | Zhang, Jinghe, Xu, Daliang, Wang, Chenghua, Xie, Weikai, Qi, Tao, Ma, Yun, Xu, Mengwei, Huang, Gang |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.20295 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
by: Yin, Wangsong, et al.
Published: (2025)
by: Yin, Wangsong, et al.
Published: (2025)
Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution
by: Chen, Zhiyang, et al.
Published: (2025)
by: Chen, Zhiyang, et al.
Published: (2025)
Fast On-device LLM Inference with NPUs
by: Xu, Daliang, et al.
Published: (2024)
by: Xu, Daliang, et al.
Published: (2024)
NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies
by: Chen, Zhiyang, et al.
Published: (2026)
by: Chen, Zhiyang, et al.
Published: (2026)
MobileQuant: Mobile-friendly Quantization for On-device Language Models
by: Tan, Fuwen, et al.
Published: (2024)
by: Tan, Fuwen, et al.
Published: (2024)
MobiEdit: Resource-efficient Knowledge Editing for Personalized On-device LLMs
by: Lu, Zhenyan, et al.
Published: (2025)
by: Lu, Zhenyan, et al.
Published: (2025)
Elastic On-Device LLM Service
by: Yin, Wangsong, et al.
Published: (2024)
by: Yin, Wangsong, et al.
Published: (2024)
PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training
by: Yi, Rongjie, et al.
Published: (2024)
by: Yi, Rongjie, et al.
Published: (2024)
PrivQuant: Communication-Efficient Private Inference with Quantized Network/Protocol Co-Optimization
by: Xu, Tianshi, et al.
Published: (2024)
by: Xu, Tianshi, et al.
Published: (2024)
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
by: Shen, Xuan, et al.
Published: (2023)
by: Shen, Xuan, et al.
Published: (2023)
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
by: Yi, Ke, et al.
Published: (2024)
by: Yi, Ke, et al.
Published: (2024)
DroidCall: A Dataset for LLM-powered Android Intent Invocation
by: Xie, Weikai, et al.
Published: (2024)
by: Xie, Weikai, et al.
Published: (2024)
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
by: Tang, Hanlin, et al.
Published: (2024)
by: Tang, Hanlin, et al.
Published: (2024)
ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
by: Liang, Yesheng, et al.
Published: (2025)
by: Liang, Yesheng, et al.
Published: (2025)
Towards Efficient Multi-Scale Deformable Attention on NPU
by: Huang, Chenghuan, et al.
Published: (2025)
by: Huang, Chenghuan, et al.
Published: (2025)
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
by: Lu, Haiquan, et al.
Published: (2026)
by: Lu, Haiquan, et al.
Published: (2026)
QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception
by: Zhao, Seth Z., et al.
Published: (2025)
by: Zhao, Seth Z., et al.
Published: (2025)
UniCAIM: A Unified CAM/CIM Architecture with Static-Dynamic KV Cache Pruning for Efficient Long-Context LLM Inference
by: Xu, Weikai, et al.
Published: (2025)
by: Xu, Weikai, et al.
Published: (2025)
MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration
by: Wang, Jinguang, et al.
Published: (2025)
by: Wang, Jinguang, et al.
Published: (2025)
EfficientQuant: An Efficient Post-Training Quantization for CNN-Transformer Hybrid Models on Edge Devices
by: Saha, Shaibal, et al.
Published: (2025)
by: Saha, Shaibal, et al.
Published: (2025)
CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
by: Han, Insu, et al.
Published: (2025)
by: Han, Insu, et al.
Published: (2025)
SliderQuant: Accurate Post-Training Quantization for LLMs
by: Wang, Shigeng, et al.
Published: (2026)
by: Wang, Shigeng, et al.
Published: (2026)
QuantFace: Efficient Quantization for Face Restoration
by: Li, Jiatong, et al.
Published: (2025)
by: Li, Jiatong, et al.
Published: (2025)
LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference
by: Liu, Dong, et al.
Published: (2024)
by: Liu, Dong, et al.
Published: (2024)
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs
by: Lin, Haokun, et al.
Published: (2024)
by: Lin, Haokun, et al.
Published: (2024)
NestQuant: Nested Lattice Quantization for Matrix Products and LLMs
by: Savkin, Semyon, et al.
Published: (2025)
by: Savkin, Semyon, et al.
Published: (2025)
D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs
by: Yan, Xianglong, et al.
Published: (2026)
by: Yan, Xianglong, et al.
Published: (2026)
GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion
by: Xie, Qizhuo, et al.
Published: (2026)
by: Xie, Qizhuo, et al.
Published: (2026)
NPU Design for Diffusion Language Model Inference
by: Lou, Binglei, et al.
Published: (2026)
by: Lou, Binglei, et al.
Published: (2026)
DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization
by: Shao, Yuantian, et al.
Published: (2025)
by: Shao, Yuantian, et al.
Published: (2025)
MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods
by: Xu, Zukang, et al.
Published: (2025)
by: Xu, Zukang, et al.
Published: (2025)
SigmaQuant: Hardware-Aware Heterogeneous Quantization Method for Edge DNN Inference
by: Liu, Qunyou, et al.
Published: (2026)
by: Liu, Qunyou, et al.
Published: (2026)
Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study
by: Zhang, Li, et al.
Published: (2025)
by: Zhang, Li, et al.
Published: (2025)
LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load
by: Tummalapalli, Pranay, et al.
Published: (2026)
by: Tummalapalli, Pranay, et al.
Published: (2026)
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
by: Tao, Wei, et al.
Published: (2026)
by: Tao, Wei, et al.
Published: (2026)
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
by: Hu, Xing, et al.
Published: (2024)
by: Hu, Xing, et al.
Published: (2024)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
by: Shao, Wenqi, et al.
Published: (2023)
by: Shao, Wenqi, et al.
Published: (2023)
Every Software as an Agent: Blueprint and Case Study
by: Xu, Mengwei
Published: (2025)
by: Xu, Mengwei
Published: (2025)
DilateQuant: Accurate and Efficient Diffusion Quantization via Weight Dilation
by: Liu, Xuewen, et al.
Published: (2024)
by: Liu, Xuewen, et al.
Published: (2024)
Scaling LLM Test-Time Compute with Mobile NPU on Smartphones
by: Hao, Zixu, et al.
Published: (2025)
by: Hao, Zixu, et al.
Published: (2025)
Similar Items
-
ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference
by: Yin, Wangsong, et al.
Published: (2025) -
Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution
by: Chen, Zhiyang, et al.
Published: (2025) -
Fast On-device LLM Inference with NPUs
by: Xu, Daliang, et al.
Published: (2024) -
NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies
by: Chen, Zhiyang, et al.
Published: (2026) -
MobileQuant: Mobile-friendly Quantization for On-device Language Models
by: Tan, Fuwen, et al.
Published: (2024)