Abstract
This paper introduces DR. KERNEL, a reinforcement learning (RL) framework designed to optimize GPU kernel code generation using Triton, a high-level GPU programming language. The authors address key challenges in RL-based kernel generation, such as reward hacking (where models exploit loopholes in reward systems) and lazy optimization (where models produce trivial but inefficient solutions). To tackle these issues, the authors develop KERNELGYM, a robust distributed GPU environment that supports multi-turn RL training, reward hacking detection, and data collection. They also propose Turn-level Reinforce-Leave-One-Out (TRLOO), an unbiased advantage estimation method for multi-turn RL, and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to improve training stability and prevent trivial optimizations. The resulting model, DR. KERNEL-14B, demonstrates competitive performance with state-of-the-art models like Claude-4.5-Sonnet and GPT-5 on the KernelBench benchmark, achieving significant speedups in GPU kernel execution. The paper also explores sequential test-time scaling, further enhancing the model's performance. All resources, including the environment, training code, models, and datasets, are made publicly available.
Methodology
The authors developed KERNELGYM, a distributed GPU environment for RL training, which includes features like reward hacking checks, multi-turn interaction support, and long-term RL training capabilities. They identified a biased policy gradient issue in GRPO and proposed the TRLOO method for unbiased advantage estimation. To address training stability and prevent lazy optimization, they introduced Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS). The model was trained on Triton kernel generation tasks using these methods and evaluated on the KernelBench benchmark.
Results
DR. KERNEL-14B achieved competitive performance with state-of-the-art models, surpassing Claude-4.5-Sonnet and GPT-5 on the KernelBench Level-2 subset. Specifically, 31.6% of the generated kernels achieved at least a 1.2Ă— speedup over the Torch reference, compared to 26.7% for Claude-4.5-Sonnet and 28.6% for GPT-5. With sequential test-time scaling, the 1.2Ă— speedup rate increased to 47.8%.
Implications
The proposed framework has significant implications for automating GPU kernel optimization, reducing the need for manual engineering expertise, and improving the efficiency of large-scale AI systems. By addressing key challenges in RL-based kernel generation, this work paves the way for more effective and scalable AI-driven code optimization in high-performance computing.
View on arXiv