Fast Transformer Decoding: One Write-Head is All You Need | Paper Notes
These are my notes and thoughts, jotted down for future reference. They may be outdated, inaccurate, or completely useless.
9/10/2025
https://arxiv.org/abs/1911.02150
Introduces multi-query attention.
MQA is a variant of MHA, but heads share a single set of keys and values.
They evaluate on WMT 2014 English-German translation task. They compare baseline MHA with MQA. Perplexity was slightly worse, but BLEU score improved. Similar results for billion-word language modeling benchmark.
The most significant result is the dramatic improvement in decoding speed. MQA model was about 12x faster than MHA.
9/10/2025