In the previous lesson, we split text into tokens. Now we turn those into word embeddings: dense vectors that capture meaning those tokens into something a computer can compute with: numbers.
Why Word Embeddings Beat Raw IDs
A language model cannot do math on the word “dragon”. It needs numbers. But we cannot just assign arbitrary IDs – “dragon” = 1, “castle” = 2 – because then the model might think dragon and castle are related just because their IDs are close.
We need vectors (lists of numbers) that actually capture meaning. This idea was revolutionized by Mikolov et al. (2013) with word2vec, which showed that vector arithmetic preserves semantic relationships: king - man + woman ? queen.
One-Hot Vectors (The Wrong Way)
The naive approach is one-hot encoding: create a vector as long as your vocabulary, put a 1 in the position of the current word and 0 everywhere else. But here is the problem: every vector is equally “far” from every other vector. The cosine similarity between any two is 0. As a result, the model has no way of knowing that “princess” and “knight” are both characters, while “castle” is a place.
Word2Vec and GloVe
Two landmark approaches shaped modern embeddings. Word2Vec (Mikolov et al., 2013) uses shallow neural networks to predict either a word from its context (CBOW) or the context from a word (skip-gram). GloVe (Pennington et al., 2014) takes a different route: it factorizes a global word co-occurrence matrix using weighted least squares, combining the best of count-based and prediction-based methods. Both produce vectors where cosine similarity genuinely reflects semantic similarity.
Word Embeddings: The Right Way
An embedding is a dense vector (for example, 128 or 4096 numbers per token) where the numbers are learned. During training, the model adjusts these vectors so that words appearing in similar contexts get similar vectors. The code is deceptively simple:
class EmbeddingLayer:
def __init__(self, vocab_size, dim):
self.weights = np.random.randn(vocab_size, dim) * 0.1
def __call__(self, token_ids):
return self.weights[token_ids]Give it a token ID, get back a vector. The magic is not in the code – it is in the training process that shapes those vectors using backpropagation. In modern LLMs, embeddings are not just lookup tables: they participate in the gradient flow and are updated jointly with every other parameter.
Word Embedding Dimensionality
Choosing the embedding dimension is a critical design decision. GPT-2 Small uses 768 dimensions; GPT-3 uses 12,288. Higher dimensions capture more nuanced relationships but require more parameters and computation. The scaling laws for neural language models show that embedding dimension should scale with the dataset size for optimal performance. You can explore this trade-off by modifying EMBED_DIM in our embeddings.py script.
Word Embeddings in the Big Picture
Here is the data flow in every LLM:
"the dragon lived" ? [2, 5, 17] ? [[0.5, 0.1, ...], [...], [...]] ? transformer
The embedding layer is the first step. Everything else – attention, feed-forward layers, output projection – operates on these vectors. Run embeddings.py to see the transition from one-hot to dense embeddings, and try changing EMBED_DIM to see how more dimensions affect similarity measurements. Pre-trained embeddings are also available via libraries like spaCy and gensim for your own projects.
In the next lesson, we will look at the core mechanism that makes transformers work: self-attention.

Leave a Reply