Guardado en:
| Autores principales: | Wang, Zhaode, Yang, Jingbang, Qian, Xinyu, Xing, Shiwen, Jiang, Xiaotang, Lv, Chengfei, Zhang, Shengyu |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2506.10443 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
MNN-AECS: Energy Optimization for LLM Decoding on Mobile Devices via Adaptive Core Selection
por: Huang, Zhengxiang, et al.
Publicado: (2025)
por: Huang, Zhengxiang, et al.
Publicado: (2025)
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
por: Li, Kunxi, et al.
Publicado: (2025)
por: Li, Kunxi, et al.
Publicado: (2025)
FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
por: Li, Kunxi, et al.
Publicado: (2025)
por: Li, Kunxi, et al.
Publicado: (2025)
MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?
por: Zou, Xingze, et al.
Publicado: (2026)
por: Zou, Xingze, et al.
Publicado: (2026)
PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models
por: Jiang, Zhonghua, et al.
Publicado: (2025)
por: Jiang, Zhonghua, et al.
Publicado: (2025)
AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization
por: Jiang, Zhonghua, et al.
Publicado: (2025)
por: Jiang, Zhonghua, et al.
Publicado: (2025)
RecGPT-Mobile: On-Device Large Language Models for User Intent Understanding in Taobao Feed Recommendation
por: Zhang, Bin, et al.
Publicado: (2026)
por: Zhang, Bin, et al.
Publicado: (2026)
Fast Distributed Inference Serving for Large Language Models
por: Wu, Bingyang, et al.
Publicado: (2023)
por: Wu, Bingyang, et al.
Publicado: (2023)
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
por: Liu, Sihao, et al.
Publicado: (2026)
por: Liu, Sihao, et al.
Publicado: (2026)
MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale Deployment
por: Huang, Hanxian, et al.
Publicado: (2026)
por: Huang, Hanxian, et al.
Publicado: (2026)
Efficient Deployment of Large Language Models on Resource-constrained Devices
por: Yao, Zhiwei, et al.
Publicado: (2025)
por: Yao, Zhiwei, et al.
Publicado: (2025)
ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation
por: Tang, Zihao, et al.
Publicado: (2024)
por: Tang, Zihao, et al.
Publicado: (2024)
Semantic Trimming and Auxiliary Multi-step Prediction for Generative Recommendation
por: Zhan, Tianyu, et al.
Publicado: (2026)
por: Zhan, Tianyu, et al.
Publicado: (2026)
Large Language Models Inference Engines based on Spiking Neural Networks
por: Balaji, Adarsha, et al.
Publicado: (2025)
por: Balaji, Adarsha, et al.
Publicado: (2025)
Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation
por: Wang, Jiawei, et al.
Publicado: (2024)
por: Wang, Jiawei, et al.
Publicado: (2024)
Collaborative Learning of On-Device Small Model and Cloud-Based Large Model: Advances and Future Directions
por: Niu, Chaoyue, et al.
Publicado: (2025)
por: Niu, Chaoyue, et al.
Publicado: (2025)
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
por: Zeng, Chao, et al.
Publicado: (2024)
por: Zeng, Chao, et al.
Publicado: (2024)
AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment
por: Fu, Yonggan, et al.
Publicado: (2024)
por: Fu, Yonggan, et al.
Publicado: (2024)
Efficient Deployment of Vision-Language Models on Mobile Devices: A Case Study on OnePlus 13R
por: Guerrero, Pablo Robin, et al.
Publicado: (2025)
por: Guerrero, Pablo Robin, et al.
Publicado: (2025)
Fast Inference for Augmented Large Language Models
por: Shahout, Rana, et al.
Publicado: (2024)
por: Shahout, Rana, et al.
Publicado: (2024)
BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models
por: Yang, Weiqin, et al.
Publicado: (2026)
por: Yang, Weiqin, et al.
Publicado: (2026)
Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization
por: Zeng, Yefan, et al.
Publicado: (2025)
por: Zeng, Yefan, et al.
Publicado: (2025)
Fast NF4 Dequantization Kernels for Large Language Model Inference
por: Qi, Xiangbo, et al.
Publicado: (2026)
por: Qi, Xiangbo, et al.
Publicado: (2026)
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
por: Chen, Yaoqi, et al.
Publicado: (2025)
por: Chen, Yaoqi, et al.
Publicado: (2025)
EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices
por: Sanyal, Arnab, et al.
Publicado: (2025)
por: Sanyal, Arnab, et al.
Publicado: (2025)
Scaling On-Device GPU Inference for Large Generative Models
por: Tang, Jiuqiang, et al.
Publicado: (2025)
por: Tang, Jiuqiang, et al.
Publicado: (2025)
Resource-Efficient Generative AI Model Deployment in Mobile Edge Networks
por: Liang, Yuxin, et al.
Publicado: (2024)
por: Liang, Yuxin, et al.
Publicado: (2024)
Graph Neural Networks Automated Design and Deployment on Device-Edge Co-Inference Systems
por: Zhou, Ao, et al.
Publicado: (2024)
por: Zhou, Ao, et al.
Publicado: (2024)
PLMM: Personal Large Language Models on Mobile Devices
por: Gong, Yuanhao
Publicado: (2023)
por: Gong, Yuanhao
Publicado: (2023)
Fast-PGM: Fast Probabilistic Graphical Model Learning and Inference
por: Jiang, Jiantong, et al.
Publicado: (2024)
por: Jiang, Jiantong, et al.
Publicado: (2024)
A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models
por: Wang, Wenkai, et al.
Publicado: (2025)
por: Wang, Wenkai, et al.
Publicado: (2025)
Collaboration of Large Language Models and Small Recommendation Models for Device-Cloud Recommendation
por: Lv, Zheqi, et al.
Publicado: (2025)
por: Lv, Zheqi, et al.
Publicado: (2025)
Making Language Models Better Tool Learners with Execution Feedback
por: Qiao, Shuofei, et al.
Publicado: (2023)
por: Qiao, Shuofei, et al.
Publicado: (2023)
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models
por: Hu, Xing, et al.
Publicado: (2024)
por: Hu, Xing, et al.
Publicado: (2024)
A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models
por: Xie, Zuan, et al.
Publicado: (2025)
por: Xie, Zuan, et al.
Publicado: (2025)
WebLLM: A High-Performance In-Browser LLM Inference Engine
por: Ruan, Charlie F., et al.
Publicado: (2024)
por: Ruan, Charlie F., et al.
Publicado: (2024)
Memory-Efficient Backpropagation for Fine-Tuning LLMs on Resource-Constrained Mobile Devices
por: Song, Congzheng, et al.
Publicado: (2025)
por: Song, Congzheng, et al.
Publicado: (2025)
GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference
por: Zeng, Chao, et al.
Publicado: (2024)
por: Zeng, Chao, et al.
Publicado: (2024)
Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices
por: Xiao, Jie, et al.
Publicado: (2024)
por: Xiao, Jie, et al.
Publicado: (2024)
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
por: Chitty-Venkata, Krishna Teja, et al.
Publicado: (2024)
por: Chitty-Venkata, Krishna Teja, et al.
Publicado: (2024)
Ejemplares similares
-
MNN-AECS: Energy Optimization for LLM Decoding on Mobile Devices via Adaptive Core Selection
por: Huang, Zhengxiang, et al.
Publicado: (2025) -
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
por: Li, Kunxi, et al.
Publicado: (2025) -
FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
por: Li, Kunxi, et al.
Publicado: (2025) -
MobileKernelBench: Can LLMs Write Efficient Kernels for Mobile Devices?
por: Zou, Xingze, et al.
Publicado: (2026) -
PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models
por: Jiang, Zhonghua, et al.
Publicado: (2025)