Blockwise Parallel Transformer for Large Context Models | Paper Notes

These are my notes and thoughts, jotted down for future reference. They may be outdated, inaccurate, or completely useless.

10/17/2025

Paper Notes

https://arxiv.org/abs/2305.19370

Introduces Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and FFN fusion to minimize memory costs.

It fuses blockwise self-attention and FFN computations. In a standard Transformer, these two operations are performed sequentially. For each block of the input sequence, it calculates the self-attention and then immediately applies the FFN to the result of that block. This avoids storing intermediate outputs (activations) for the entire sequence.

In comparison to FlashAttention, which computes via tiling, BPT doesn't require the output of the entire sequence for FFN.

Paper Notes

10/17/2025

Fast Inference from Transformers via Speculative Decoding | Paper Notes
Fast Transformer Decoding: One Write-Head is All You Need | Paper Notes
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Paper Notes

Blockwise Parallel Transformer for Large Context Models | Paper Notes

Read more

Blockwise Parallel Transformer for Large Context Models | Paper Notes

Read more