Building a transformer model from scratch

Building a Transformer From Scratch

Now we build a complete transformer model. We have built tokenization, embeddings, and attention separately – now let us assemble them into a complete decoder-only transformer, structurally identical to GPT.

Transformer Model Architecture

A decoder-only transformer is a stack of identical blocks, each containing:

  • Self-attention – every token looks at every previous token via causal masking
  • Feed-forward network – a small neural net processes each token independently with a ReLU or GELU activation
  • Residual connections – shortcut paths so gradients flow easily during training, formulated as output = x + sublayer(x)
  • Layer normalization – keeps activations in a healthy range by normalizing across the feature dimension

Unlike the original Transformer which had both an encoder and decoder, modern LLMs use decoder-only architectures exclusively. As the HuggingFace LLM course explains, causal language modeling (predicting the next token left-to-right) proved to be the most scalable training objective. GPT-3, LLaMA, Claude, Gemini, and Mistral all use this design.

Transformer Model Shape Walkthrough

Every transformer layer preserves the shape (seq_len, embed_dim). Only the final projection changes it to (seq_len, vocab_size). Run transformer.py to see this in action with a 24-token vocabulary and 32-dimensional embeddings.

This is the “residual stream” – information flows through the model and each block reads from and writes to it. Without residual connections, a 100-layer transformer would be untrainable; the gradients would vanish or explode. With residuals, each layer only computes a “delta” – a small correction to the input. He et al. (2016) originally demonstrated this principle with ResNets in computer vision, and it was adopted by the Transformer architecture.

Positional Encoding

Since self-attention is permutation-invariant (it processes sets, not sequences), we need to inject position information. The original Transformer used fixed sinusoidal encodings. Modern models like LLaMA use Rotary Position Embeddings (RoPE), which encode position through rotation matrices applied to queries and keys. This allows the model to learn relative positions and extrapolate to longer sequences at inference time.

The Language Model Head

After the decoder blocks, a final linear layer (often called the “language model head” or lm_head) projects the hidden states into logits over the vocabulary. During training, we shift the logits right by one position and compute cross-entropy loss against the original token sequence. This is how the model learns to predict the next token at every position simultaneously.

In the next lesson, we will train this transformer using PyTorch and watch it learn to generate fairy tales.

Leave a Reply

Your email address will not be published. Required fields are marked *


Categories


Tag Cloud

.net algorithms angular api Array arrays basic-concepts big o blazor c# classes code components containers control-structures csharp data structures data types Deep Learning dictionaries docker dom dotnet framework functions git guide javascript json leetcode linq lists methods MVC Natural Language Processing npm object oriented programming oop operators Python sorted Transformers tutorial typescript web framework