Slicing LLM Reminiscence by 84%: A Deep Dive into Fused Kernels

The Stride Swap: When computing P . W_T , we don’t truly have to bodily transpose the huge W matrix in reminiscence. As a substitute, we invert the shapes and strides in W ’s block pointer to learn the rows of W as columns of W^T . This ends in a “free” transpose that saves each time and VRAM.
Numerical Precision: It&#8217;s value noting that whereas X and W is perhaps in bfloat16 , the buildup of dW and dX through atomic_add is normally carried out in float32 to stop the buildup of tiny rounding errors throughout hundreds of rows.
Competition Word: Whereas atomic_add is important for dW (as a result of each program updates the identical weights), dX is non-public to every program, that means there may be zero competition between program IDs for that particular tensor.
Atomic Add Masking: atomic_add doesn’t help block pointers. Subsequently, we implement the pointer and masks logic for dW explicitly.

I Stole a Wall Road Trick to Resolve a Google Traits Knowledge Drawback

Write C Code With out Studying C: The Magic of PythoC

or fine-tuned an LLM, you’ve probably hit a wall on the final step: the Cross-Entropy Loss.

The wrongdoer is the logit bottleneck. To foretell the following token, we undertaking a hidden state into an enormous vocabulary house. For Llama 3 (128,256 tokens), the load matrix alone is over 525 million parameters. Whereas that’s solely ~1GB in bfloat16, the intermediate logit tensor is the actual problem. For big batches, it could possibly simply exceed 80GB of VRAM simply to compute a single scalar loss.

Optimising this layer is how libraries like Unsloth and Liger-Kernel obtain such large reminiscence reductions. On this article, we’ll construct a fused Linear + Cross Entropy kernel from scratch in Triton. We’ll derive the maths and implement a tiled ahead and backward cross that slashes peak reminiscence utilization by 84%.

Word on Efficiency: This implementation is primarily instructional. We prioritise mathematical readability and readable Triton code through the use of world atomic operations. Whereas it solves the reminiscence bottleneck, matching production-grade speeds would require considerably extra complicated implementations that are out of scope for this text.

This put up is a part of my Triton sequence. We’ll be utilizing ideas like tiling and on-line softmax that we’ve coated beforehand. If these sound unfamiliar, I like to recommend catching up there first!