Learn · Beginner

Chain-of-thought: why making an AI think out loud makes it smarter

Ask a language model a tricky question and demand an instant answer, and it often gets it wrong. Ask the same model to think step by step first, and it frequently gets it right. That gap, from one phrase, is one of the most important discoveries in how to use these systems. It is called chain-of-thought, and understanding it explains a lot about why modern thinking models behave the way they do.

What it is

Chain-of-thought means having a model generate its intermediate reasoning, the steps in between the question and the answer, instead of jumping straight to a conclusion. If you ask how many tennis balls fit in a problem and the model first writes out so there are three cans, each can has three balls, that is nine, then it states the answer, it is doing chain-of-thought. The landmark result, Wei et al. (2022), showed that simply prompting large models to produce these steps sharply improved their performance on arithmetic, commonsense, and logic problems, with no retraining at all. A companion finding, Kojima et al. (2022), showed you do not even need examples: just appending let's think step by step to a prompt unlocks much of the benefit.

Why it works

There are two intertwined reasons, and the second one is genuinely surprising.

The first is decomposition. Hard problems have parts. A model that must produce the final answer in a single step has to do all the work invisibly, in one pass. By writing intermediate steps, it breaks a big leap into a chain of small, manageable ones, and each written step becomes context the model can lean on for the next. It is the difference between doing long division in your head and doing it on paper. The paper holds your place so you do not have to keep everything in working memory at once. For how that working memory is structured, see transformers and context windows.

The second reason is subtler and was sharpened by recent work from Google Research, which we covered in why thinking helps models remember. Every token a model generates is another pass of computation. Generating a reasoning trace literally gives the model more compute steps before it has to commit to an answer. Astonishingly, the researchers found that even semantically empty filler, repeating something like let me think, improves recall, because the extra tokens act as a computational buffer. The content of the thinking still matters, but part of the magic is simply giving the model room to compute. Think of it as the model muttering to itself: even the muttering helps, because the brain keeps working during the pause.

Making it more reliable

A single chain of reasoning can go off the rails. One influential improvement, self-consistency, has the model generate several independent reasoning paths and then take the answer that most of them agree on, the way you might solve a problem three different ways and trust the answer you reached twice. This majority vote over multiple chains reliably beats a single chain, because wrong reasoning tends to be wrong in scattered, inconsistent ways while correct reasoning converges.

From a trick to a trained-in skill

Chain-of-thought began as a prompting trick, but it has since been baked into models directly. Today's reasoning models are trained, often with reinforcement learning, to produce long internal reasoning before answering. DeepSeek-R1 (2025) is a well-known example where the model learned, through reward, to think extensively on its own. This is why thinking models feel slower and more expensive: they are spending many tokens reasoning before they reply, and those tokens cost compute. It is also why how much thinking budget you allow has become a real product dial.

Where it backfires

Chain-of-thought is not free magic. The same Google Research work flags the danger: if the model generates a wrong intermediate fact, that error primes the wrong knowledge and can amplify into a confidently wrong final answer, a failure that connects directly to hallucination. More thinking is better only when the thinking stays grounded; when it drifts, the model builds a tidy argument on a premise it invented a sentence ago.

There is also a trust trap worth naming: a model's stated reasoning is not guaranteed to be the actual reason for its answer. It can produce a plausible-looking chain that rationalizes a conclusion it reached by other means. So a convincing explanation is not proof the model reasoned correctly, only that it can write a convincing explanation. Useful, often illuminating, but not a window you should trust blindly.

Key papers
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022)
Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)
DeepSeek-R1 (2025)