After training our transformer, text generation is the art of converting model logits – raw scores for each token in the vocabulary. But how do we turn those logits into actual words? The answer is sampling, and the choices we make dramatically affect the quality of generated text.
Text Generation: Greedy Decoding vs. Sampling
The simplest approach is greedy decoding: always pick the token with the highest probability. However, this produces the most “likely” text, but it is often dull and repetitive. The model gets stuck in loops because the most likely continuation of “I like” might always be “I like ice cream I like ice cream I like…”
Text Generation Temperature
Temperature scaling divides the logits by a temperature parameter T before applying softmax. At T=1, no change. Raising to T=1.5 flattens the distribution (more randomness, creativity). Dropping to T=0.1 sharpens the distribution (more deterministic, focused). Specifically, the formula is: P(token) = softmax(logits / T). Higher temperatures produce more surprising text; lower temperatures stick to high-probability tokens.
Top-K and Top-P (Nucleus) Sampling
Top-k sampling (Fan et al., 2018) restricts sampling to the k most likely tokens, cutting off the long tail of improbable tokens. Top-p (nucleus) sampling (Holtzman et al., 2019) is more adaptive: it selects the smallest set of tokens whose cumulative probability exceeds p (typically 0.9-0.95). Modern LLMs like GPT-4 combine top-p and temperature for best results.
Run generate.py to compare these strategies side by side. Try temperature=0.8 with top_p=0.9 for creative but coherent fairy tales, or temperature=1.5 with top_k=40 for wilder outputs.
In the next lesson, we will move from our tiny model to a real pre-trained model – running GPT-2 locally and seeing what a 1.5B parameter model can do.
Repetition Penalty and Beam Search
A common problem with sampling is repetition – the model gets stuck on a word or phrase. A repetition penalty discounts the logits of tokens that have already appeared in the generated sequence. The penalty factor (typically 1.0-1.2) is multiplied with the probability of previously generated tokens, making them less likely to be chosen again. Beam search, an alternative to sampling, maintains multiple candidate sequences (beams) and selects the one with the highest overall probability. While beam search produces more coherent text for tasks like translation, it tends to be less creative than sampling for open-ended generation.
The HuggingFace generation documentation provides an excellent interactive comparison of all these strategies. Experiment with different combinations to develop intuition for how each parameter affects output quality.
Practical Generation Tips
Choosing the right generation strategy depends on your use case. For factual answers (summarization, translation), use lower temperatures (0.1-0.3) with top-p=0.9 and a repetition penalty of 1.1. For creative writing or brainstorming, use higher temperatures (0.8-1.2) with top-k=50 and top-p=0.95. Chatbots typically use a middle ground: temperature=0.7 with top-p=0.9. The HuggingFace Transformers .generate() method accepts all these parameters directly. Always set max_new_tokens to limit response length and pad_token_id=eos_token_id to handle edge cases properly.
Understanding sampling strategies is essential for anyone working with LLMs. The difference between a model that produces engaging, coherent text versus one that rambles or repeats often comes down to these parameters. Experiment systematically: start with temperature=0.8, top-p=0.9, and repetition_penalty=1.1, then adjust one parameter at a time to understand its effect.

Leave a Reply