On Apple devices, AI technology is showing remarkable potential. According to the latest research from Gimlet Labs, AI can automatically generate optimized Metal kernels, which have improved PyTorch inference speed by 87%. This breakthrough not only enhances performance but also achieved an average 1.87x acceleration on 215 PyTorch modules tested, with some workloads seeing hundreds of times faster speeds.
Researchers selected eight AI models from multiple top institutions, including Anthropic, DeepSeek, and OpenAI, to generate optimized GPU kernels for Apple devices. This process does not require modifying user code or using new frameworks, directly improving model performance on Apple hardware.
In the experiment, the research team used a Mac Studio (equipped with Apple M4Max chip) for testing, with the baseline set to PyTorch's eager mode. The experiment used 215 PyTorch modules from the KernelBench dataset, which were divided into three categories, covering from simple matrix multiplication to full model architectures.
The testing process included receiving input and PyTorch code, generating Metal kernels, and evaluating their correctness. Data showed that as the number of attempts increased, the correctness of AI-generated kernels gradually improved. For example, on the fifth attempt, the proportion of correctly implemented kernels reached 94%. In addition, the models demonstrated cross-level capabilities when generating kernels, although non-inference models could sometimes also generate effective kernels.
The experimental results showed that the GPT-5 model achieved a 4.65x speed increase on certain tasks. More surprisingly, the o3 model reduced latency by 9000 times in some cases. The study also found that a single model did not always perform best on certain tasks, and combining multiple models could generate better kernels.
To further improve performance, researchers tried introducing additional contextual information, such as CUDA implementations and gputrace performance analysis data. The results showed that this approach achieved an average 1.87x performance acceleration, which was three times higher than the 1.31x acceleration achieved by ordinary agents.
It should be noted that the researchers emphasized that this work is not intended to demonstrate the ultimate performance limit, but rather to verify the feasibility of AI in kernel generation, aiming to reduce the burden on developers through automation. Overall, this study marks an important advancement in AI technology in the field of hardware optimization.
github:https://github.com/ScalingIntelligence/KernelBench/
Key Points:
🌟 AI automatically generates Metal kernels, increasing PyTorch inference speed by 87%.
⚡️ Achieves an average 1.87x acceleration on 215 PyTorch modules, with some workloads increasing by hundreds of times.
🔍 The study aims to verify the feasibility of AI in kernel generation, helping hardware optimization.