:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Tuo, Li, Ning, Yuan, Xin, Xu, Wenchao, Chen, Quan, Guo, Song, Zhang, Haijun
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2508.07329
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge
by: Li, Muqing, et al.
Published: (2025)

The MoE-Empowered Edge LLMs Deployment: Architecture, Challenges, and Opportunities
by: Li, Ning, et al.
Published: (2025)

A QoE-Aware Split Inference Accelerating Algorithm for NOMA-based Edge Intelligence
by: Yuan, Xin, et al.
Published: (2024)

Edge Deployment of Small Language Models, a comprehensive comparison of CPU, GPU and NPU backends
by: Prieto, Pablo, et al.
Published: (2025)

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
by: Yi, Ke, et al.
Published: (2024)

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
by: Wei, Jianyu, et al.
Published: (2024)

Beyond the Edge: An Advanced Exploration of Reinforcement Learning for Mobile Edge Computing, its Applications, and Future Research Trajectories
by: Yang, Ning, et al.
Published: (2024)

Privacy-Aware Joint DNN Model Deployment and Partitioning Optimization for Collaborative Edge Inference Services
by: Cheng, Zhipeng, et al.
Published: (2025)

Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
by: Chen, Jiesong, et al.
Published: (2026)

Efficient Quantization-Aware Training on Segment Anything Model in Medical Images and Its Deployment
by: Lu, Haisheng, et al.
Published: (2024)

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU
by: Ning, Zhenyu, et al.
Published: (2024)

HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs
by: Wang, Guoan, et al.
Published: (2026)

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
by: Yao, Feiyu, et al.
Published: (2026)

Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment
by: Tan, Qitao, et al.
Published: (2026)

Fine-Tuning and Deploying Large Language Models Over Edges: Issues and Approaches
by: Dong, Yanjie, et al.
Published: (2024)

Collaborative Compression for Large-Scale MoE Deployment on Edge
by: Chen, Yixiao, et al.
Published: (2025)

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
by: Zhang, Shu-Hao, et al.
Published: (2026)

APreQEL: Adaptive Mixed Precision Quantization For Edge LLMs
by: Bouzouad, Meriem, et al.
Published: (2026)

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning
by: Zhang, Jinhao Zhang Yunquan, et al.
Published: (2026)

Efficient CPU-GPU Collaborative Inference for MoE-based LLMs on Memory-Limited Systems
by: Huang, En-Ming, et al.
Published: (2025)

A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems
by: Wu, Qi, et al.
Published: (2026)

Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment
by: Chakravarty, Aditya
Published: (2024)

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges
by: Li, Senyao, et al.
Published: (2025)

HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network
by: Zheng, Peirong, et al.
Published: (2026)

OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models
by: Roh, Seungwoo, et al.
Published: (2026)

Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices
by: Ye, Shengyuan, et al.
Published: (2025)

Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
by: Lin, Haokun, et al.
Published: (2025)

Differentiable Initialization-Accelerated CPU-GPU Hybrid Combinatorial Scheduling
by: Liu, Mingju, et al.
Published: (2026)

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
by: Chen, Mengzhao, et al.
Published: (2024)

From Poisoned to Aware: Fostering Backdoor Self-Awareness in LLMs
by: Shen, Guangyu, et al.
Published: (2025)

KernelBench: Can LLMs Write Efficient GPU Kernels?
by: Ouyang, Anne, et al.
Published: (2025)

Adaptive Reasoning Executor: A Collaborative Agent System for Efficient Reasoning
by: Ling, Zehui, et al.
Published: (2025)

Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing
by: Yang, Ning, et al.
Published: (2026)

Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
by: Lee, Deokjae, et al.
Published: (2025)

Why Transformers Need Adam: A Hessian Perspective
by: Zhang, Yushun, et al.
Published: (2024)

FAQ: Mitigating Quantization Error via Regenerating Calibration Data with Family-Aware Quantization
by: Xiao, Haiyang, et al.
Published: (2026)

Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices
by: Qin, Ruiyang, et al.
Published: (2024)

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs
by: Ye, Rongguang, et al.
Published: (2025)

Privacy-Preserving SAM Quantization for Efficient Edge Intelligence in Healthcare
by: Li, Zhikai, et al.
Published: (2024)

DualRT: A Qos‐Aware Soft Real‐Time Video Analytics Framework for Dual‐Stage GPU‐CPU Tasks on Edge
by: Changhong Zhu, et al.
Published: (2025)