Saved in:
| Main Authors: | Zhang, Tuo, Li, Ning, Yuan, Xin, Xu, Wenchao, Chen, Quan, Guo, Song, Zhang, Haijun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.07329 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge
by: Li, Muqing, et al.
Published: (2025)
by: Li, Muqing, et al.
Published: (2025)
The MoE-Empowered Edge LLMs Deployment: Architecture, Challenges, and Opportunities
by: Li, Ning, et al.
Published: (2025)
by: Li, Ning, et al.
Published: (2025)
A QoE-Aware Split Inference Accelerating Algorithm for NOMA-based Edge Intelligence
by: Yuan, Xin, et al.
Published: (2024)
by: Yuan, Xin, et al.
Published: (2024)
Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends
by: Prieto, Pablo, et al.
Published: (2025)
by: Prieto, Pablo, et al.
Published: (2025)
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
by: Yi, Ke, et al.
Published: (2024)
by: Yi, Ke, et al.
Published: (2024)
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
by: Wei, Jianyu, et al.
Published: (2024)
by: Wei, Jianyu, et al.
Published: (2024)
Beyond the Edge: An Advanced Exploration of Reinforcement Learning for Mobile Edge Computing, its Applications, and Future Research Trajectories
by: Yang, Ning, et al.
Published: (2024)
by: Yang, Ning, et al.
Published: (2024)
Privacy-Aware Joint DNN Model Deployment and Partitioning Optimization for Collaborative Edge Inference Services
by: Cheng, Zhipeng, et al.
Published: (2025)
by: Cheng, Zhipeng, et al.
Published: (2025)
Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
by: Chen, Jiesong, et al.
Published: (2026)
by: Chen, Jiesong, et al.
Published: (2026)
Efficient Quantization-Aware Training on Segment Anything Model in Medical Images and Its Deployment
by: Lu, Haisheng, et al.
Published: (2024)
by: Lu, Haisheng, et al.
Published: (2024)
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
by: Ning, Zhenyu, et al.
Published: (2024)
by: Ning, Zhenyu, et al.
Published: (2024)
HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs
by: Wang, Guoan, et al.
Published: (2026)
by: Wang, Guoan, et al.
Published: (2026)
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
by: Yao, Feiyu, et al.
Published: (2026)
by: Yao, Feiyu, et al.
Published: (2026)
Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment
by: Tan, Qitao, et al.
Published: (2026)
by: Tan, Qitao, et al.
Published: (2026)
Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches
by: Dong, Yanjie, et al.
Published: (2024)
by: Dong, Yanjie, et al.
Published: (2024)
Collaborative Compression for Large-Scale MoE Deployment on Edge
by: Chen, Yixiao, et al.
Published: (2025)
by: Chen, Yixiao, et al.
Published: (2025)
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
by: Zhang, Shu-Hao, et al.
Published: (2026)
by: Zhang, Shu-Hao, et al.
Published: (2026)
APreQEL: Adaptive Mixed Precision Quantization For Edge LLMs
by: Bouzouad, Meriem, et al.
Published: (2026)
by: Bouzouad, Meriem, et al.
Published: (2026)
HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning
by: Zhang, Jinhao Zhang Yunquan, et al.
Published: (2026)
by: Zhang, Jinhao Zhang Yunquan, et al.
Published: (2026)
Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems
by: Huang, En-Ming, et al.
Published: (2025)
by: Huang, En-Ming, et al.
Published: (2025)
A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems
by: Wu, Qi, et al.
Published: (2026)
by: Wu, Qi, et al.
Published: (2026)
Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment
by: Chakravarty, Aditya
Published: (2024)
by: Chakravarty, Aditya
Published: (2024)
Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges
by: Li, Senyao, et al.
Published: (2025)
by: Li, Senyao, et al.
Published: (2025)
HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network
by: Zheng, Peirong, et al.
Published: (2026)
by: Zheng, Peirong, et al.
Published: (2026)
OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models
by: Roh, Seungwoo, et al.
Published: (2026)
by: Roh, Seungwoo, et al.
Published: (2026)
Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices
by: Ye, Shengyuan, et al.
Published: (2025)
by: Ye, Shengyuan, et al.
Published: (2025)
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
by: Lin, Haokun, et al.
Published: (2025)
by: Lin, Haokun, et al.
Published: (2025)
Differentiable Initialization-Accelerated CPU-GPU Hybrid Combinatorial Scheduling
by: Liu, Mingju, et al.
Published: (2026)
by: Liu, Mingju, et al.
Published: (2026)
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
by: Chen, Mengzhao, et al.
Published: (2024)
by: Chen, Mengzhao, et al.
Published: (2024)
From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs
by: Shen, Guangyu, et al.
Published: (2025)
by: Shen, Guangyu, et al.
Published: (2025)
KernelBench: Can LLMs Write Efficient GPU Kernels?
by: Ouyang, Anne, et al.
Published: (2025)
by: Ouyang, Anne, et al.
Published: (2025)
Adaptive Reasoning Executor: A Collaborative Agent System for Efficient Reasoning
by: Ling, Zehui, et al.
Published: (2025)
by: Ling, Zehui, et al.
Published: (2025)
Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing
by: Yang, Ning, et al.
Published: (2026)
by: Yang, Ning, et al.
Published: (2026)
Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
by: Lee, Deokjae, et al.
Published: (2025)
by: Lee, Deokjae, et al.
Published: (2025)
Why Transformers Need Adam: A Hessian Perspective
by: Zhang, Yushun, et al.
Published: (2024)
by: Zhang, Yushun, et al.
Published: (2024)
FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization
by: Xiao, Haiyang, et al.
Published: (2026)
by: Xiao, Haiyang, et al.
Published: (2026)
Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices
by: Qin, Ruiyang, et al.
Published: (2024)
by: Qin, Ruiyang, et al.
Published: (2024)
On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs
by: Ye, Rongguang, et al.
Published: (2025)
by: Ye, Rongguang, et al.
Published: (2025)
Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare
by: Li, Zhikai, et al.
Published: (2024)
by: Li, Zhikai, et al.
Published: (2024)
DualRT: A Qos‐Aware Soft Real‐Time Video Analytics Framework for Dual‐Stage GPU‐CPU Tasks on Edge
by: Changhong Zhu, et al.
Published: (2025)
by: Changhong Zhu, et al.
Published: (2025)
Similar Items
-
CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge
by: Li, Muqing, et al.
Published: (2025) -
The MoE-Empowered Edge LLMs Deployment: Architecture, Challenges, and Opportunities
by: Li, Ning, et al.
Published: (2025) -
A QoE-Aware Split Inference Accelerating Algorithm for NOMA-based Edge Intelligence
by: Yuan, Xin, et al.
Published: (2024) -
Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends
by: Prieto, Pablo, et al.
Published: (2025) -
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
by: Yi, Ke, et al.
Published: (2024)