Tin Rabzelj
Tin Rabzelj
Dashed Line

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation | Paper Notes

8/27/2025

https://arxiv.org/abs/2108.12409

Introduces Attention with Linear Biases (ALiBi) technique that aims to improve LLM's ability to extrapolate to longer sequences than it was trained on.

ALiBi negatively biases attention scores with a linearly decreasing penalty proportional to the distance between the relevant key and query. Their approach "eliminates position embeddings."

Training on shorter sequences costs less. Their 1.3B model trained on L=1024L=1024 tokens with ALiBi achieves the same perplexity as a sinusoidal PE model trained on L=2048L=2048 when both are tested on sequences of 2048 tokens, even though their model is 11% faster and uses 11% less memory.

In classic approach, position embeddings are added to the word embeddings at the bottom of the network. For an input subsequence of length LL, the attention sublayer computes the attention scores of the ii-th query qiR1×d,(1iL)q_i\in \mathbb{R}^{1\times d},(1\le i\le L) in each head, given the first ii keys KRi×dK\in \mathbb{R}^{i\times d}, where dd is the head dimension:

softmax(qiKT).softmax(q_iK^{T}).

These scores are then multiplied by the values.

ALiBi addds a static, non-learned bias after the query-key dot product:

softmax(qiKT+m[(i1),,2,1,0]),\text{softmax}(q_i {K}^T + m \cdot [-(i-1), \ldots, -2, -1, 0]),

where mm is a head-specific slope fixed before training.

8/27/2025

Read more