This is the most important concept in modern LLMs. Once you understand attention, you understand transformers. In the previous lesson, we turned tokens into vectors. Now we need those vectors to talk to each other.
The Problem with Fixed Context
In our bigram model, each token only saw one token of context. That is why it generated gibberish. We need a way for each token to “look at” every other token in the sequence and decide what is relevant.
That is exactly what attention does. Consider: “The dragon guarded the treasure because it was valuable.” What does “it” refer to? Treasure, not dragon. Attention is the mechanism that lets the model figure this out. The seminal paper “Attention Is All You Need” (Vaswani et al., 2017) introduced this architecture and demonstrated that a model relying solely on attention mechanisms could outperform recurrent and convolutional approaches on machine translation.
How Self-Attention Works
Each token produces three vectors:
- Query (Q): “What am I looking for?”
- Key (K): “What information do I contain?”
- Value (V): “What do I pass on if I am relevant?”
Then we compare every query against every key to get attention scores, scale them by 1/sqrt(d_k) to prevent vanishing gradients, turn them into probabilities with softmax, and compute a weighted sum of the values. In matrix form:
Attention(Q, K, V) = softmax(Q @ K.T / sqrt(d_k)) @ V
The scaling factor is critical: without it, the dot products grow large in magnitude, pushing the softmax into regions with extremely small gradients. This scaled dot-product attention is the standard building block across all transformer models today.
To prevent tokens from “cheating” by looking at future tokens during generation, we apply a causal mask – setting future attention scores to -infinity before softmax. This is why when you run attention.py, you see a triangular weight matrix where each token only attends to itself and previous tokens. This masking is what enables autoregressive generation.
Multi-Head Attention
One attention head is good. Multiple heads are better. Each head learns to look for different relationships: one head might track noun-adjective relationships, another might track subject-verb agreement, and a third might resolve pronouns. The outputs are concatenated and projected back to the original dimension. The Transformer base model used 8 heads; GPT-3 uses 96 heads across 96 layers.
Self-attention has a key limitation: it is quadratic in the sequence length (O(n^2)), since every token attends to every other token. This is why models have context length limits. Recent architectures like Longformer and Sliding Window Attention (Mistral) address this by using sparse attention patterns.
In the next lesson, we will assemble everything into a complete transformer – combining embeddings, attention, and feed-forward layers into a working model.

Leave a Reply