Saved in:
Bibliographic Details
Main Authors: Guo, Hui, Zheng, Qihang, Huo, Chenghai, Guo, Dongliang, Yang, Haoqi, Zhang, Yang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.21571
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908731847999488
author Guo, Hui
Zheng, Qihang
Huo, Chenghai
Guo, Dongliang
Yang, Haoqi
Zhang, Yang
author_facet Guo, Hui
Zheng, Qihang
Huo, Chenghai
Guo, Dongliang
Yang, Haoqi
Zhang, Yang
contents The efficient deployment of large language models (LLMs) is hindered by memory architecture heterogeneity, where traditional compilers suffer from fragmented workflows and high adaptation costs. We present nncase, an open-source, end-to-end compilation framework designed to unify optimization across diverse targets. Central to nncase is an e-graph-based term rewriting engine that mitigates the phase ordering problem, enabling global exploration of computation and data movement strategies. The framework integrates three key modules: Auto Vectorize for adapting to heterogeneous computing units, Auto Distribution for searching parallel strategies with cost-aware communication optimization, and Auto Schedule for maximizing on-chip cache locality. Furthermore, a buffer-aware Codegen phase ensures efficient kernel instantiation. Evaluations show that nncase outperforms mainstream frameworks like MLC LLM and Intel IPEX on Qwen3 series models and achieves performance comparable to the hand-optimized llama.cpp on CPUs, demonstrating the viability of automated compilation for high-performance LLM deployment. The source code is available at https://github.com/kendryte/nncase.
format Preprint
id arxiv_https___arxiv_org_abs_2512_21571
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle nncase: An End-to-End Compiler for Efficient LLM Deployment on Heterogeneous Storage Architectures
Guo, Hui
Zheng, Qihang
Huo, Chenghai
Guo, Dongliang
Yang, Haoqi
Zhang, Yang
Distributed, Parallel, and Cluster Computing
Machine Learning
The efficient deployment of large language models (LLMs) is hindered by memory architecture heterogeneity, where traditional compilers suffer from fragmented workflows and high adaptation costs. We present nncase, an open-source, end-to-end compilation framework designed to unify optimization across diverse targets. Central to nncase is an e-graph-based term rewriting engine that mitigates the phase ordering problem, enabling global exploration of computation and data movement strategies. The framework integrates three key modules: Auto Vectorize for adapting to heterogeneous computing units, Auto Distribution for searching parallel strategies with cost-aware communication optimization, and Auto Schedule for maximizing on-chip cache locality. Furthermore, a buffer-aware Codegen phase ensures efficient kernel instantiation. Evaluations show that nncase outperforms mainstream frameworks like MLC LLM and Intel IPEX on Qwen3 series models and achieves performance comparable to the hand-optimized llama.cpp on CPUs, demonstrating the viability of automated compilation for high-performance LLM deployment. The source code is available at https://github.com/kendryte/nncase.
title nncase: An End-to-End Compiler for Efficient LLM Deployment on Heterogeneous Storage Architectures
topic Distributed, Parallel, and Cluster Computing
Machine Learning
url https://arxiv.org/abs/2512.21571