Blog

Personal Notes External

These are my notes and thoughts, jotted down for future reference. They may be outdated, inaccurate, or completely useless.

Fast Inference from Transformers via Speculative Decoding | Paper Notes

Fast Transformer Decoding: One Write-Head is All You Need | Paper Notes

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Paper Notes

Ring Attention with Blockwise Transformers for Near-Infinite Context | Paper Notes

Effective Long-Context Scaling of Foundation Models | Paper Notes

YaRN: Efficient Context Window Extension of Large Language Models | Paper Notes

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention | Paper Notes

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Paper Notes

Longformer: The Long-Document Transformer | Paper Notes

ReAct: Synergizing Reasoning and Acting in Language Models | Paper Notes

RoFormer: Enhanced Transformer with Rotary Position Embedding | Paper Notes

The Impact of Positional Encoding on Length Generalization in Transformers | Paper Notes

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation | Paper Notes

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | Paper Notes

Tree of Thoughts: Deliberate Problem Solving with Large Language Models | Paper Notes

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | Paper Notes

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer | Paper Notes

Language Models are Few-Shot Learners | Paper Notes

Language Models are Unsupervised Multitask Learners | Paper Notes

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Paper Notes

Improving Language Understanding by Generative Pre-Training | Paper Notes

Attention Is All You Need | Paper Notes

Neural Machine Translation by Jointly Learning to Align and Translate | Paper Notes

Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference | Paper Notes

Sequence to Sequence Learning with Neural Networks | Paper Notes

GloVe: Global Vectors for Word Representation | Paper Notes

Efficient Estimation of Word Representations in Vector Space | Paper Notes