Learn · Intermediate
Test-Time Compute: Spending More Thinking at the Moment You Ask
Test-time compute is the idea of making an AI smarter not by retraining it, but by letting it do more work at the moment you ask a question. Instead of taking the model's first, fastest answer, you let it think for longer, or generate many answers and select the best one. It is one of the biggest levers in modern AI: the same fixed model can go from mediocre to strong on hard problems just by being allowed to spend more effort per question.
Key facts
- Test-time compute is spent at inference (when you ask), not during training. The model's weights never change.
- The two main forms are thinking longer (a long chain of reasoning before answering) and sampling many answers (then voting or verifying).
- Snell and colleagues at Google DeepMind and Berkeley (2024) showed that spending compute at test time can beat spending the same compute on a bigger model, for a range of problems.
- The approach has a ceiling: selection methods like majority voting saturate, while catching at least one correct answer keeps improving.
For most of deep learning's history, the way to get a better answer was to train a bigger model on more data. Test-time compute flips that. It says: keep the model fixed, and instead invest effort at the moment of the question. This matters because thinking is often much cheaper than training, and you can dial it up only for the hard questions that need it.
The first form is simply thinking longer. When a model works through a problem step by step before committing to an answer (see our lesson on chain-of-thought reasoning), those intermediate steps are test-time compute. Reasoning models are trained to produce long internal reasoning traces precisely so they can spend more thought on demand. This is also why a truncated reasoning trace is dangerous: if the serving system cuts the thinking short, the model answers before it has finished working, as reported in the case of GPT-5.5 Codex clamping its reasoning at 516 tokens.
The second form is sampling many answers. Because a language model picks its next word from a probability distribution, running it several times on the same prompt gives you several different attempts. A useful analogy: asking one expert once versus asking a room of experts and taking a show of hands. If you take the most common answer, that is called self-consistency or majority voting, introduced by Wang and colleagues in 2022. If instead you keep every attempt and use a separate checker to find a correct one, that is called coverage or best-of-N, studied at large scale in the aptly named Large Language Monkeys paper (a nod to the infinite-monkeys idea that enough random tries eventually produce the right output).
Here is the crucial subtlety. Voting and verifying behave very differently as you add samples. Majority voting plateaus quickly, because the many tries at one problem are correlated: they tend to make the same mistakes together, so a hundred tries are not a hundred independent opinions. Where the model's most common answer is simply wrong, more voting makes it more wrong, not less. Coverage is different: if you only need one correct answer somewhere in the pile for a verifier to pick out, more tries keep helping for a long time. The practical takeaway, sharpened in the July 2026 result on why more samples stop helping, is that the bottleneck is usually recognizing a right answer, not generating one. A good verifier is worth more than a bigger sample budget.
Why does this matter to anyone building with AI? Because test-time compute is a knob you control at runtime. You can spend a little on easy questions and a lot on hard ones, trade latency and cost for accuracy, and get better results from a model you cannot retrain. It also connects to how reasoning models are trained in the first place, via reinforcement learning post-training that rewards good long reasoning.
The honest caveat: test-time compute is not free and not unlimited. Every extra sample or extra reasoning step costs money and time, the gains flatten out, and voting-based methods can entrench confident errors. The art is knowing which questions deserve the extra thought and having a way to tell a good answer from a bad one when you get there.
Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2022)
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (Brown et al., 2024)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Snell et al., 2024)
Key questions
What is test-time compute?
Why generate many answers instead of just one?
Does more test-time compute always help?
Cite this
APA
Ground Truth. (2026, July 4). Test-Time Compute: Spending More Thinking at the Moment You Ask. Ground Truth. https://groundtruth.day/learn/test-time-compute.html
BibTeX
@misc{groundtruth:test-time-compute,
title = {Test-Time Compute: Spending More Thinking at the Moment You Ask},
author = {{Ground Truth}},
year = {2026},
month = {jul},
url = {https://groundtruth.day/learn/test-time-compute.html}
}