TwELL Format Boosts LLM Performance with 20% Speedup and Reduced Memory Usage on GPUs

May 9, 2026
TwELL Format Boosts LLM Performance with 20% Speedup and Reduced Memory Usage on GPUs
  • The work introduces two main contributions: TwELL, a sparse packing format optimized for tiled matrix multiplication kernels, and custom CUDA kernels that fuse multiple sparse operations and compress TwELL to minimize activation sizes, boosting throughput.

  • Benchmark results on billion-parameter scale models show more than a 20% speedup along with reduced peak memory usage and lower energy consumption.

  • The approach promises improvements for both LLM inference and training, potentially enabling smaller, more accessible models and shaping future optimization strategies for sparse transformer architectures.

  • A collaboration between Sakana AI and NVIDIA addresses hardware mismatches that slow sparsity in large language models on GPUs by reshaping sparsity to fit hardware constraints.

  • The article cites a technical paper for TwELL (arXiv:2603.23198) and notes related perspectives on optimizing LLM inference and training, with additional context from StartupHub.ai coverage.

  • TwELL (Tile-wise ELLPACK) is a hybrid sparse data format that routes 99% of sparse tokens through a fast path while employing a dense backup matrix for the remaining, enabling more efficient GPU processing.

Summary based on 1 source


Get a daily email with more AI stories

Source

Faster LLMs by Reshaping Sparsity

StartupHub.ai • May 9, 2026

Faster LLMs by Reshaping Sparsity

More Stories