Tokenization - How LLMs See Text

In the previous lesson, we built a language model that predicted whole words. But there is a problem: a model that only knows whole words can never handle a word it has not seen before. Enter tokenization – the process of splitting text into pieces small enough that every possible input can be represented.

The Problem with Words

If your model has a vocabulary of 50,000 words, what happens when someone types “flabbergasted” and you only know “flabbergast”? With word-level tokens, you are stuck. But if your vocabulary includes subword pieces like flab, ber, gast, ed, you can assemble any word from those parts – even words you have never seen before. This insight, originally popularized by Sennrich et al. (2016) in their ACL paper on neural machine translation, transformed how NLP systems handle open vocabularies.

Byte-Pair Encoding (BPE)

BPE, originally a data compression algorithm by Gage (1994), was adapted for tokenization by Sennrich and colleagues. It is elegant in its simplicity:

Start with individual characters as your tokens
Count every adjacent pair of tokens in your training text
Find the most frequent pair and merge it into a new token
Repeat until you reach your desired vocabulary size

The algorithm discovers its own vocabulary automatically. Common words become single tokens, while rare words stay split into smaller pieces. Modern implementations are available in the HuggingFace tokenizers library, which provides highly optimized BPE training in Rust.

BPE in Practice: GPT-4’s Tokenizer

OpenAI’s GPT-4 uses a BPE tokenizer called cl100k_base, accessible via the tiktoken library. It has a vocabulary of 100,256 tokens and operates on bytes rather than characters, enabling it to handle any Unicode string. This byte-level approach, introduced by Radford et al. (2019) in the GPT-2 paper, ensures that even emojis and non-Latin scripts are handled gracefully. You can explore how any string is tokenized using OpenAI’s online tokenizer.

Seeing It In Action

When we train BPE on our fairy tale corpus, the first merge is always the most frequent character pair. After 30 merges, our vocabulary grows from individual characters to 40 tokens. The word “princess” becomes a single token (it appears often), while “dragon” stays split into d, r, a, g, on (it appears less frequently).

Run bpe.py to see this in action. Try changing num_merges to observe how the vocabulary evolves. You can also compare your results against the HuggingFace tokenizer training tutorial for a production-grade implementation.

Tokenization Artifacts

Tokenization is not invisible – it leaves artifacts everywhere. Ask ChatGPT how many letters are in “strawberry”: it might say 9 instead of 10 because the tokenizer splits it as straw + berry. These quirks affect model behavior in subtle ways: some languages are tokenized more efficiently than others (English averages ~1.3 tokens per word, while Swahili averages ~2.5), which can impact both cost and quality. The tiktokenizer web app lets you visualize exactly how any text is split by different models.

Why This Matters

Tokenization is the first real step in understanding LLMs. Everything else builds on it: embeddings turn these tokens into vectors, the attention mechanism processes those vectors, and the transformer generates new tokens one at a time. The vocabulary size and merge strategy directly affect model capacity, inference speed, and the kinds of patterns the model can learn.

And when you see ChatGPT claiming a word has “5 letters” when it has 6 – that is because the model does not see letters, it sees tokens. Tokenization artifacts are visible everywhere once you know what to look for. For a deep dive, consult the HuggingFace tokenizers documentation or the original subword-nmt toolkit.

In the next lesson, we will look at embeddings – how to turn tokens into numbers that a model can compute with.

LinKzy.dev

Leave a Reply Cancel reply

Tokenization: How LLMs See Text