Learn · Beginner

Temperature and top-p: how an AI actually picks its next word

Here is something most people get wrong about language models: the model does not decide what to say next. Not directly. At every step, all it produces is a giant list of odds - a probability for every possible next word-piece in its vocabulary. The word "cat" might get a high number, "dog" a slightly lower one, "refrigerator" a tiny one, and so on across tens of thousands of options. (Those options are word-pieces, not whole words - see tokenization for why.) The model hands you this list of odds and then something else has to actually pick one. That picking step is called sampling or decoding, and the rules you set for it are the single biggest reason an AI can feel dull and robotic one moment and inventive or unhinged the next.

Start with the simplest possible rule: always pick the single most likely word. This is called greedy decoding, and it sounds smart - why wouldn't you always take the best bet? The problem is that always taking the safest word produces flat, repetitive, lifeless text. It gets stuck in loops, repeats phrases, and reads like a form letter. The landmark paper on this, The Curious Case of Neural Text Degeneration, showed that human writing does not actually follow the most-probable path - real language is full of slightly surprising word choices, and a model that never surprises sounds inhuman. So instead of always grabbing the top word, we roll dice weighted by the odds. The high-probability words come up most often, but the model occasionally reaches for a less obvious choice, and that little bit of randomness is what makes the text feel alive.

The main knob for controlling that dice roll is temperature. Picture the model's list of odds as a landscape of hills, the tallest hills being the likeliest words. Temperature reshapes that landscape before the roll. Turn temperature down toward zero and the tallest hill grows into a mountain that dominates everything - the model almost always takes the single most likely word, giving you consistent, predictable, safe output (great for factual answers or code). Turn temperature up and you flatten the landscape - the tall hills shrink and the little ones rise, so unlikely words get a real chance, giving you creative, varied, sometimes chaotic output (good for brainstorming, bad for accuracy). Crank it too high and the text dissolves into nonsense, because you have made "refrigerator" nearly as likely as "cat." A useful mental model: low temperature is a careful, cautious writer; high temperature is a caffeinated improviser.

But temperature alone has a flaw. Even after reshaping, there are thousands of absurd words sitting in the list with tiny odds, and if you roll enough times, one of those garbage words eventually comes up and derails the whole sentence. So we add a second knob that cuts off the long tail of bad options entirely. The most popular version is top-p, also called nucleus sampling, and it is the other big idea from that same 2020 paper. Top-p works like this: line up the candidate words from most to least likely, then keep adding words to the shortlist until their combined odds cross a threshold you set - say ninety percent - and throw away everything else. The model then only rolls dice among that shortlist. The clever part is that the shortlist grows and shrinks automatically. When the model is confident (only a couple of words make sense), the shortlist is tiny. When it is genuinely uncertain (many words would work), the shortlist is large. An older, simpler cousin called top-k does the same job but with a fixed shortlist length - always the top forty words, say - which is cruder because it can't adapt to how confident the model is; it was introduced in an earlier story-generation paper. In practice, people often use temperature and top-p together: top-p removes the truly bad options, and temperature tunes how boldly the model chooses among the good ones.

Why does any of this matter to you? Because it explains behavior you have definitely seen. Ask the same model the same question twice and get two different answers? That is sampling - unless randomness is turned fully off, each run draws differently and can branch off in a new direction. Notice that a chatbot writing code is rock-steady but the same chatbot writing a poem is wildly varied? The people running it turned the temperature down for code and up for creative writing. And here is the uncomfortable connection: the same randomness that makes AI writing feel natural is also entangled with why it makes things up. A model rolling dice among plausible-sounding words has no built-in sense of which ones are true - it only knows which ones are likely, and likely is not the same as correct.

The deeper point is that a language model is fundamentally a probability machine, not a decision machine. All the intelligence lives in that list of odds it produces at each step; sampling is just the humble ritual of turning odds into an actual word. But that humble ritual is a control panel, and once you know the two main dials - temperature for how much it gambles, top-p for how big a pool it gambles from - a lot of an AI's personality stops being mysterious and starts being something you can adjust. Speed tricks like speculative decoding sit on top of this same word-by-word process, making it faster without changing which words get drawn.

Key papers
The Curious Case of Neural Text Degeneration (Holtzman et al., 2020) - introduces nucleus / top-p sampling
Hierarchical Neural Story Generation (Fan et al., 2018) - introduces top-k sampling

Key questions

What does temperature do to an AI's output?

Temperature controls how much the model gambles on less likely words - low temperature makes it pick safe, predictable words and sound consistent, while high temperature makes it take more risks and sound creative or erratic. It reshapes the odds before the model draws its next word.

What is the difference between temperature and top-p?

Temperature rescales how confident the model is across all possible next words, while top-p limits the pool to only the most likely words that together cover a set share of the probability. They are two different knobs that both control randomness, and they are often used together.

Why does the same prompt give different answers each time?

Because the model does not choose the single best word - it draws randomly from a list of weighted options, so unless randomness is switched fully off, each run can land on a different word and spin off in a new direction.

Cite this

APA

Ground Truth. (2026, June 30). Temperature and top-p: how an AI actually picks its next word. Ground Truth. https://groundtruth.day/learn/how-ai-picks-its-next-word.html

BibTeX

@misc{groundtruth:how-ai-picks-its-next-word,
  title  = {Temperature and top-p: how an AI actually picks its next word},
  author = {{Ground Truth}},
  year   = {2026},
  month  = {jun},
  url    = {https://groundtruth.day/learn/how-ai-picks-its-next-word.html}
}

Topics: sampling · temperature · decoding · fundamentals · inference