If you have ever wondered how ChatGPT works under the hood, you have come to the right place. Let us strip away the hype and build a language model from absolute zero — using nothing but Python and a frequency table. There are no neural networks, no GPUs, and no math beyond counting.
What is a Language Model?
A language model is just a system that predicts what word comes next. That is it. That is the entire idea.
If I show you The princess lived in a ___, you would probably guess “castle” (or “cave” if you are a dragon fan). Your brain learned this pattern from years of reading, listening, and speaking. A language model does the same thing — but with math and data instead of a human brain.
The Simplest Possible Model: Bigram
A “bigram” model looks at the last word and asks: what word have I seen follow this word before?
Here is how it works in three steps:
- Read a bunch of text
- Count how often each word follows every other word
- Pick — given a word, choose the next one based on those counts
No neural networks. No matrices. No GPUs. Pure counting. This is how the very first language models worked in the 1950s.
The Code
The entire model fits in about 15 lines of Python. Download it from the LLM Zero to Hero repository and follow along:
from collections import defaultdict, Counter
import random
def build_bigram_model(tokens):
counts = defaultdict(Counter)
for current, next_token in zip(tokens, tokens[1:]):
counts[current][next_token] += 1
return counts
def generate(model, seed, length=20):
tokens = [seed]
for _ in range(length - 1):
candidates = model[tokens[-1]]
words, weights = zip(*candidates.items())
next_word = random.choices(words, weights=weights, k=1)[0]
tokens.append(next_word)
return ' '.join(tokens)Training on a Fairy Tale
Let us train on a tiny corpus — a fairy tale with princesses, dragons, and knights. After counting, the model learns that after “the”, the next word is most often “princess” (seen 4 times), followed by “dragon” (3 times) and “knight” (3 times).
When generating, we pick based on these probabilities. If “the” was followed by “princess” 4 out of 10 times, then 40% of the time we will pick “princess”, 30% “dragon”, 30% “knight”.
What Comes Out
With our tiny dataset, generation looks like: the knight fought the dragon lived in a treasure the dragon — it is repetitive and weird, but it is English-like. The model learned sentence structure from just 9 sentences. Try running bigram.py with your own text to see how it adapts.
Why This Matters
This bigram model is the same core mechanism that powers ChatGPT — just scaled up a few billion times. Every modern LLM, at its heart, is still trying to predict the next token. The difference is how much context they use (thousands of tokens instead of one) and the complexity of the predictor (a transformer instead of a frequency table).
In the next lesson, we will look at tokenization — how do we split text into pieces small enough that the model can handle any possible word?

Leave a Reply