FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Paper Notes

These are my notes and thoughts, jotted down for future reference. They may be outdated, inaccurate, or completely useless.

8/28/2025

Paper Notes

https://arxiv.org/abs/2205.14135

Basically, splits input $Q$ , $K$ and $V$ into blocks, loads them from slow HBM to fast SRAM, then computes normalized attention output per block, then adds them up to the correct result at the end.

FlashAttention is an algorithm designed to make the attention mechanism in Transformers faster and more memory-efficient without sacrificing accuracy. Standard attention is often memory-bound, because reading and writing large intermediate attention matrices to and from the GPU's High Bandwidth Memory (HBM) is slow. FlashAttention restructures the computation using tiling and recomputation.

It splits the input matrices into smaller blocks (tiling) that can fit into the GPU's much faster on-chip SRAM. All computation for these blocks is then performed within this fast memory. To avoid storing the large attention matrix for the backward pass, it instead stores a small softmax normalization factor and recomputes the attention matrix on-chip during backprop. This reduces the number of slow memory accesses.

Paper Notes

8/28/2025

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Paper Notes

Read more