Saved in:
Bibliographic Details
Main Authors: Sun, Qitong, Han, Jun, Li, Tianlin, Tang, Zhe, Chen, Sheng, Yang, Fei, Liu, Aishan, Liu, Xianglong, Liu, Yang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.10085
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial-and-error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories. Specifically, we present KernelSkill, a multi-agent framework with a dual-level memory architecture. KernelSkill operates by coordinating agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.