Saved in:
| Main Authors: | Zhuge, Xiangwen, Shen, Xu, Wang, Zeyu, Dang, Fan, Ding, Xuan, Li, Danyang, Han, Yahui, Hao, Tianxiang, Yang, Zheng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.10259 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices
by: Xu, Shen, et al.
Published: (2026)
by: Xu, Shen, et al.
Published: (2026)
Camel: Energy-Aware LLM Inference on Resource-Constrained Devices
by: Xu, Hao, et al.
Published: (2025)
by: Xu, Hao, et al.
Published: (2025)
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024)
by: Jiang, Xuanlin, et al.
Published: (2024)
SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling
by: Lv, Cunchi, et al.
Published: (2025)
by: Lv, Cunchi, et al.
Published: (2025)
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
by: Lin, Shouxu, et al.
Published: (2026)
by: Lin, Shouxu, et al.
Published: (2026)
GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference
by: Tang, Zengzipeng, et al.
Published: (2026)
by: Tang, Zengzipeng, et al.
Published: (2026)
APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs
by: Fan, Jiakun, et al.
Published: (2025)
by: Fan, Jiakun, et al.
Published: (2025)
SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
by: Svirschevski, Ruslan, et al.
Published: (2024)
by: Svirschevski, Ruslan, et al.
Published: (2024)
PLS-Assisted Offloading for Edge Computing-Enabled Post-Quantum Security in Resource-Constrained Devices
by: Amiriara, Hamid, et al.
Published: (2025)
by: Amiriara, Hamid, et al.
Published: (2025)
SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling
by: Liu, Zikun, et al.
Published: (2026)
by: Liu, Zikun, et al.
Published: (2026)
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
by: Liu, Yangyijian, et al.
Published: (2025)
by: Liu, Yangyijian, et al.
Published: (2025)
Challenging GPU Dominance: When CPUs Outperform for On-Device LLM Inference
by: Zhang, Haolin, et al.
Published: (2025)
by: Zhang, Haolin, et al.
Published: (2025)
LIME:Accelerating Collaborative Lossless LLM Inference on Memory-Constrained Edge Devices
by: Sun, Mingyu, et al.
Published: (2025)
by: Sun, Mingyu, et al.
Published: (2025)
Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices
by: Wang, Ziyao, et al.
Published: (2025)
by: Wang, Ziyao, et al.
Published: (2025)
GPU-Augmented OLAP Execution Engine: GPU Offloading
by: Chang, Ilsun
Published: (2025)
by: Chang, Ilsun
Published: (2025)
FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
by: Du, Hongchao, et al.
Published: (2025)
by: Du, Hongchao, et al.
Published: (2025)
VeriSplit: Secure and Practical Offloading of Machine Learning Inferences across IoT Devices
by: Zhang, Han, et al.
Published: (2024)
by: Zhang, Han, et al.
Published: (2024)
FDC: Fast KV Dimensionality Compression for Efficient LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
PecSched: Preemptive and Efficient Cluster Scheduling for LLM Inference
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
MLP-Offload: Multi-Level, Multi-Path Offloading for LLM Pre-training to Break the GPU Memory Wall
by: Maurya, Avinash, et al.
Published: (2025)
by: Maurya, Avinash, et al.
Published: (2025)
Latent Sensor Fusion: Multimedia Learning of Physiological Signals for Resource-Constrained Devices
by: Ahmed, Abdullah, et al.
Published: (2025)
by: Ahmed, Abdullah, et al.
Published: (2025)
Perspectives on Devices for Integrated Phononic Circuits
by: Yihang Yao, et al.
Published: (2025)
by: Yihang Yao, et al.
Published: (2025)
TURNIP: A "Nondeterministic" GPU Runtime with CPU RAM Offload
by: Ding, Zhimin, et al.
Published: (2024)
by: Ding, Zhimin, et al.
Published: (2024)
Characterize LSM-tree Compaction Performance via On-Device LLM Inference
by: Ding, Jiabiao, et al.
Published: (2026)
by: Ding, Jiabiao, et al.
Published: (2026)
Improved Decision Module Selection for Hierarchical Inference in Resource-Constrained Edge Devices
by: Behera, Adarsh Prasad, et al.
Published: (2024)
by: Behera, Adarsh Prasad, et al.
Published: (2024)
TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading
by: Pan, Yudong, et al.
Published: (2026)
by: Pan, Yudong, et al.
Published: (2026)
Stop Overthinking: Unlocking Efficient Listwise Reranking with Minimal Reasoning
by: Liu, Danyang, et al.
Published: (2026)
by: Liu, Danyang, et al.
Published: (2026)
A Multi-LLM-Agent-Based Framework for Economic and Public Policy Analysis
by: Hao, Yuzhi, et al.
Published: (2025)
by: Hao, Yuzhi, et al.
Published: (2025)
Scaling On-Device GPU Inference for Large Generative Models
by: Tang, Jiuqiang, et al.
Published: (2025)
by: Tang, Jiuqiang, et al.
Published: (2025)
Temporal-Aware GPU Resource Allocation for Distributed LLM Inference via Reinforcement Learning
by: Du, Chengze, et al.
Published: (2025)
by: Du, Chengze, et al.
Published: (2025)
TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
by: Jiang, Haoyun, et al.
Published: (2026)
by: Jiang, Haoyun, et al.
Published: (2026)
Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference
by: Joo, Donghyeon, et al.
Published: (2024)
by: Joo, Donghyeon, et al.
Published: (2024)
Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading
by: Meng, William, et al.
Published: (2025)
by: Meng, William, et al.
Published: (2025)
HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
by: Zhao, Xuanlei, et al.
Published: (2024)
by: Zhao, Xuanlei, et al.
Published: (2024)
A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks
by: Han, Mingqi, et al.
Published: (2026)
by: Han, Mingqi, et al.
Published: (2026)
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding
by: Li, Guanghao, et al.
Published: (2025)
by: Li, Guanghao, et al.
Published: (2025)
SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
by: Zheng, Ce, et al.
Published: (2026)
by: Zheng, Ce, et al.
Published: (2026)
SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding
by: Yin, Haofei, et al.
Published: (2025)
by: Yin, Haofei, et al.
Published: (2025)
FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
by: Liu, Xing, et al.
Published: (2025)
by: Liu, Xing, et al.
Published: (2025)
Dependency Tasks Offloading and Communication Resource Allocation in Collaborative UAVs Networks: A Meta-Heuristic Approach
by: Nguyen, Loc X., et al.
Published: (2022)
by: Nguyen, Loc X., et al.
Published: (2022)
Similar Items
-
HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained Devices
by: Xu, Shen, et al.
Published: (2026) -
Camel: Energy-Aware LLM Inference on Resource-Constrained Devices
by: Xu, Hao, et al.
Published: (2025) -
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
by: Jiang, Xuanlin, et al.
Published: (2024) -
SpecInF: Exploiting Idle GPU Resources in Distributed DL Training via Speculative Inference Filling
by: Lv, Cunchi, et al.
Published: (2025) -
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
by: Lin, Shouxu, et al.
Published: (2026)