Fast Transformer Decoding: One Write-Head is All You Need | Paper Notes

These are my notes and thoughts, jotted down for future reference. They may be outdated, inaccurate, or completely useless.

9/10/2025

Paper Notes

https://arxiv.org/abs/1911.02150

Introduces multi-query attention.

MQA is a variant of MHA, but heads share a single set of keys and values.

They evaluate on WMT 2014 English-German translation task. They compare baseline MHA with MQA. Perplexity was slightly worse, but BLEU score improved. Similar results for billion-word language modeling benchmark.

The most significant result is the dramatic improvement in decoding speed. MQA model was about 12x faster than MHA.

Paper Notes

9/10/2025

Blockwise Parallel Transformer for Large Context Models | Paper Notes
Fast Inference from Transformers via Speculative Decoding | Paper Notes
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Paper Notes

Fast Transformer Decoding: One Write-Head is All You Need | Paper Notes

Read more

Fast Transformer Decoding: One Write-Head is All You Need | Paper Notes

Read more