Understanding Flash Attention: Writing the Algorithm from Scratch in Triton
Why is Flash Attention so fast? Find out how Flash Attention works. Afterward, we'll polish our understanding by writing a GPU kernel of the algorithm in Triton.
I'm a Machine Learning Researcher and Engineer. Here I write posts on the intersection of deep learning theory and high-performance system engineering
Training quantized neural networks involves a fundamental trade-off: how should you divide your compute budget between full-precision pretraining and quantization-aware training?
Read the Article →Why is Flash Attention so fast? Find out how Flash Attention works. Afterward, we'll polish our understanding by writing a GPU kernel of the algorithm in Triton.
It's all about making your models run faster, from flicking a magic “compile” switch to writing your own custom GPU code. In each step, we’ll implement an innocent softmax function, but things are about to get dark by the end.
If all machine learning engineers want one thing, it's faster model training — maybe after good test metrics.