Meta Engineering·April 2, 2026

KernelEvolve: Optimizing AI Infrastructure through Autonomous Kernel Generation

This article introduces KernelEvolve, Meta's agentic kernel authoring system that autonomously generates and optimizes low-level hardware kernels for diverse AI models and heterogeneous hardware. It addresses the scalability bottleneck of manual kernel tuning by leveraging AI agents, search algorithms, and a feedback loop to significantly improve inference and training throughput.

AI & ML Infrastructure Performance & Scaling Distributed Systems

Read original on Meta Engineering

The Challenge of AI Infrastructure Optimization at Scale

Meta operates a vast fleet of heterogeneous hardware, including NVIDIA GPUs, AMD GPUs, and custom MTIA silicon. Efficiently utilizing this hardware for diverse and evolving AI models requires highly optimized, chip-specific kernel instructions. The number of unique kernel configurations grows exponentially with the product of hardware types, model architectures, and operator types. Manually authoring and optimizing these kernels for each new chip generation and model architecture is an insurmountable task for human experts, creating a critical bottleneck in hardware enablement and model iteration cycles.

Explosive Kernel Growth Factors

Hardware Heterogeneity: Different memory architectures, instruction sets, and execution models across NVIDIA, AMD, and Meta's custom MTIA chips, plus architectural changes within successive generations of the same hardware.
Model Architecture Variation: Evolution of recommendation models (embedding-based, sequence learning, generative, LLM-scale) introduces new operator types and requires optimization across fundamentally different model families.
Kernel Diversity / Long Tail: Beyond standard operations covered by vendor libraries (e.g., GEMMs), production workloads involve a long tail of custom operators (feature hashing, bucketing, specialized attention variants) that lack native accelerator implementations and would otherwise fall back to CPU or run inefficiently.

KernelEvolve's Agentic Approach to Kernel Optimization

KernelEvolve addresses these challenges by treating kernel optimization as a structured search problem rather than one-shot code generation. It leverages an agentic AI system to autonomously generate, evaluate, and refine kernel implementations. This system significantly compresses optimization time from weeks to hours and often surpasses human expert performance.

💡

Key System Design Principles of KernelEvolve

KernelEvolve demonstrates a powerful pattern for solving complex, combinatorial optimization problems in infrastructure: combine LLM-based code generation with a robust feedback loop and search engine to iteratively converge on optimal solutions, especially for heterogeneous environments where manual tuning is infeasible.

Core Components and Flow

LLM Synthesizer: Generates candidate kernels in various languages (Triton, CUDA, HIP, MTIA C++) using dynamic, context-aware prompts enriched with runtime diagnostics, hardware constraints, and historical optimization signals.
Tree Search Engine: Explores the optimization space using graph-based algorithms (e.g., Monte Carlo tree search), treating each candidate as a node. It selects promising candidates, applies transformations, evaluates results, and decides on further exploration or backtracking.
Purpose-built Job Harness: A long-running system that compiles, evaluates (for correctness and performance), and profiles hardware utilization for hundreds of candidate kernels in parallel, feeding diagnostics back to the LLM.

This continuous feedback loop allows KernelEvolve to adapt to evolving hardware and model changes, ensuring sustained performance optimization across Meta's massive and diverse AI infrastructure. The system leverages a retrieval-augmented knowledge base to provide platform-specific documentation to the LLM, enabling reasoning over diverse hardware architectures without prior training.

AI infrastructurekernel optimizationheterogeneous computingLLM agentsperformance tuninghardware accelerationMetadistributed AI