AutoKernel Boosts GPU Optimization for PyTorch Models with Overnight Speedups and Rigorous Testing

April 6, 2026
AutoKernel Boosts GPU Optimization for PyTorch Models with Overnight Speedups and Rigorous Testing
  • AutoKernel enables rapid, overnight optimization cycles that yield end-to-end speedups on real models by allocating exploration effort according to each kernel’s share of total runtime, rather than optimizing isolated kernels.

  • The framework supports both Triton and CUDA C++ backends and covers nine kernel types, including matmul, softmax, layernorm, rmsnorm, cross_entropy, and rotary_embedding, with a common kernel_fn() interface for consistency.

  • The project includes a full evaluation in a paper and a public repository, with community demonstrations such as a Triton FP4 kernel outperforming specialized implementations in certain configurations.

  • Benchmarks on NVIDIA H100 show significant gains for memory-bound kernels, with RMSNorm, softmax, and cross-entropy achieving notable throughput improvements over eager execution and torch.compile, driven by fused multi-operation ATen decompositions that reduce memory traffic.

  • A five-stage correctness harness ensures safety before recording speedups: smoke tests, shape/config sweeps across 10+ configurations, numerical stability tests, determinism checks, and non-power-of-two edge case tests, with dtype-specific tolerances for FP16, BF16, and FP32.

  • The core idea mirrors an expert kernel engineer’s workflow: write a candidate kernel, benchmark it, keep improvements, discard regressions, and repeat, with every experiment captured as a git commit and results logged in a human-readable results.tsv file.

  • AutoKernel's performance is competitive with, and in some cases superior to, TorchInductor’s general fusion and autotuning, though matmul remains a challenging area where cuBLAS is highly optimized and further improvements are targeted.

  • AutoKernel is an open-source framework from RightNow AI that uses an autonomous LLM agent loop to automate GPU kernel optimization for arbitrary PyTorch models, delivering faster Triton and CUDA kernels without requiring GPU expertise.

  • Each optimization iteration takes about 90 seconds, enabling roughly 300 to 400 experiments per 10-hour overnight run, and the process draws on a six-tier optimization playbook encoded in program.md to guide the agent.

  • AutoKernel starts from a full PyTorch model, profiles per-kernel GPU time using torch.profiler, and prioritizes targets via Amdahl’s law to maximize end-to-end speedups rather than focusing solely on individual kernels.

  • Overall, AutoKernel combines profiling, Amdahl-based prioritization, and a six-tier playbook to drive end-to-end speedups while maintaining safety through rigorous testing and reproducible logging.

Summary based on 1 source


Get a daily email with more AI stories

More Stories