Now we build a complete transformer model. We have built tokenization, embeddings, and attention separately – now let us assemble them into a complete decoder-only transformer, structurally identical to GPT.
Transformer Model Architecture
A decoder-only transformer is a stack of identical blocks, each containing:
- Self-attention – every token looks at every previous token via causal masking
- Feed-forward network – a small neural net processes each token independently with a ReLU or GELU activation
- Residual connections – shortcut paths so gradients flow easily during training, formulated as
output = x + sublayer(x) - Layer normalization – keeps activations in a healthy range by normalizing across the feature dimension
Unlike the original Transformer which had both an encoder and decoder, modern LLMs use decoder-only architectures exclusively. As the HuggingFace LLM course explains, causal language modeling (predicting the next token left-to-right) proved to be the most scalable training objective. GPT-3, LLaMA, Claude, Gemini, and Mistral all use this design.
Transformer Model Shape Walkthrough
Every transformer layer preserves the shape (seq_len, embed_dim). Only the final projection changes it to (seq_len, vocab_size). Run transformer.py to see this in action with a 24-token vocabulary and 32-dimensional embeddings.
This is the “residual stream” – information flows through the model and each block reads from and writes to it. Without residual connections, a 100-layer transformer would be untrainable; the gradients would vanish or explode. With residuals, each layer only computes a “delta” – a small correction to the input. He et al. (2016) originally demonstrated this principle with ResNets in computer vision, and it was adopted by the Transformer architecture.
Positional Encoding
Since self-attention is permutation-invariant (it processes sets, not sequences), we need to inject position information. The original Transformer used fixed sinusoidal encodings. Modern models like LLaMA use Rotary Position Embeddings (RoPE), which encode position through rotation matrices applied to queries and keys. This allows the model to learn relative positions and extrapolate to longer sequences at inference time.
The Language Model Head
After the decoder blocks, a final linear layer (often called the “language model head” or lm_head) projects the hidden states into logits over the vocabulary. During training, we shift the logits right by one position and compute cross-entropy loss against the original token sequence. This is how the model learns to predict the next token at every position simultaneously.
In the next lesson, we will train this transformer using PyTorch and watch it learn to generate fairy tales.

Leave a Reply