Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention | Paper Notes
8/29/2025
https://arxiv.org/abs/2404.07143
Infini-attention is a attention mechanism designed to allow Transformer models to process infinitely long sequences of text while using a bounded amount of memory. It combines two forms of attention within a single Transformer block: a standard local attention for short-term context and a compressive memory for long-term context.
Input sequences are processed in segments/chunks. For each segment, it calculates local attention and compressive memory. Local attention is standard scaled dot-product attention but within a single segment. Instead of discarding the Key-Value (KV) states from past segments, it compresses and stores them in a fixed-size "compressive memory" matrix. When processing a new segment, the model uses the new query (Q) vectors to retrieve relevant information from this compressed memory of the entire history.
The memory matrix is an associative matrix that stores compressed information from past segments. It's a two-dimensional matrix. For each attention head in each layer, there is a separate memory matrix. Its dimensions are .
The model combines local and long-term attention using a learned gating mechanism. A single learnable parameter per attention head decides whether to prioritize local information or historical information. The model updates the compressive memory by using the new key and value states.
Content is retrieved from the compressive memory using the current query state. Normalization is there to ensure stability.
represents the retrieved content, is the query, is the memory state from the previous segment, is a normalization term, is a nonlinear activation function.
The memory is updated with the new key-value pairs from the current segment. This is done by adding the outer product of the transformed keys and values to the existing memory matrix.
The normalization term is updated as:
An alternative update rule (inspired by the delta rule) first subtracts the existing value associated with a key before adding the new value. This prevents modification of the memory if the key-value binding already exists.
is the new memory state, is the previous state, represents the keys, and the values from the current segment.
Long-term context injection is the process of combining the information retrieved from the compressive memory with the local attention context. This is done by using a learned gating scalar (). This gate determines the balance between the long-term context and the short-term. The final attention output is a weighted sum of these two components.
Where:
- is the final aggregated attention context.
- is the content retrieved from the long-term compressive memory.
- is the local attention state from the standard dot-product attention.
- is a single learnable scalar parameter.
- The function squashes the value of to be between and , acting as a soft gate.
- is element-wise multiplication.
More:
8/29/2025