Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation | Paper Notes
8/27/2025
https://arxiv.org/abs/2108.12409
Introduces Attention with Linear Biases (ALiBi) technique that aims to improve LLM's ability to extrapolate to longer sequences than it was trained on.
ALiBi negatively biases attention scores with a linearly decreasing penalty proportional to the distance between the relevant key and query. Their approach "eliminates position embeddings."
Training on shorter sequences costs less. Their 1.3B model trained on tokens with ALiBi achieves the same perplexity as a sinusoidal PE model trained on when both are tested on sequences of 2048 tokens, even though their model is 11% faster and uses 11% less memory.
In classic approach, position embeddings are added to the word embeddings at the bottom of the network. For an input subsequence of length , the attention sublayer computes the attention scores of the -th query in each head, given the first keys , where is the head dimension:
These scores are then multiplied by the values.
ALiBi addds a static, non-learned bias after the query-key dot product:
where is a head-specific slope fixed before training.
8/27/2025