TwELL Format Boosts LLM Performance with 20% Speedup and Reduced Memory Usage on GPUs
May 9, 2026
The work introduces two main contributions: TwELL, a sparse packing format optimized for tiled matrix multiplication kernels, and custom CUDA kernels that fuse multiple sparse operations and compress TwELL to minimize activation sizes, boosting throughput.
Benchmark results on billion-parameter scale models show more than a 20% speedup along with reduced peak memory usage and lower energy consumption.
The approach promises improvements for both LLM inference and training, potentially enabling smaller, more accessible models and shaping future optimization strategies for sparse transformer architectures.
A collaboration between Sakana AI and NVIDIA addresses hardware mismatches that slow sparsity in large language models on GPUs by reshaping sparsity to fit hardware constraints.
The article cites a technical paper for TwELL (arXiv:2603.23198) and notes related perspectives on optimizing LLM inference and training, with additional context from StartupHub.ai coverage.
TwELL (Tile-wise ELLPACK) is a hybrid sparse data format that routes 99% of sparse tokens through a fast path while employing a dense backup matrix for the remaining, enabling more efficient GPU processing.
Summary based on 1 source
Get a daily email with more AI stories
Source

StartupHub.ai • May 9, 2026
Faster LLMs by Reshaping Sparsity