Saved in:
Bibliographic Details
Main Authors: Sun, Qitong, Han, Jun, Li, Tianlin, Tang, Zhe, Chen, Sheng, Yang, Fei, Liu, Aishan, Liu, Xianglong, Liu, Yang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.10085
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912959990595584
author Sun, Qitong
Han, Jun
Li, Tianlin
Tang, Zhe
Chen, Sheng
Yang, Fei
Liu, Aishan
Liu, Xianglong
Liu, Yang
author_facet Sun, Qitong
Han, Jun
Li, Tianlin
Tang, Zhe
Chen, Sheng
Yang, Fei
Liu, Aishan
Liu, Xianglong
Liu, Yang
contents Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial-and-error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories. Specifically, we present KernelSkill, a multi-agent framework with a dual-level memory architecture. KernelSkill operates by coordinating agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.
format Preprint
id arxiv_https___arxiv_org_abs_2603_10085
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization
Sun, Qitong
Han, Jun
Li, Tianlin
Tang, Zhe
Chen, Sheng
Yang, Fei
Liu, Aishan
Liu, Xianglong
Liu, Yang
Machine Learning
Artificial Intelligence
Multiagent Systems
Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial-and-error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories. Specifically, we present KernelSkill, a multi-agent framework with a dual-level memory architecture. KernelSkill operates by coordinating agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.
title KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization
topic Machine Learning
Artificial Intelligence
Multiagent Systems
url https://arxiv.org/abs/2603.10085