Tin Rabzelj
Tin Rabzelj
Dashed Line

Fast Transformer Decoding: One Write-Head is All You Need | Paper Notes

9/10/2025

https://arxiv.org/abs/1911.02150

Introduces multi-query attention.

MQA is a variant of MHA, but heads share a single set of keys and values.

They evaluate on WMT 2014 English-German translation task. They compare baseline MHA with MQA. Perplexity was slightly worse, but BLEU score improved. Similar results for billion-word language modeling benchmark.

The most significant result is the dramatic improvement in decoding speed. MQA model was about 12x faster than MHA.

9/10/2025

Read more