Learn · Intermediate

Speculative Decoding: How AI Types Faster Without Changing a Word

Language models are slow for one specific reason: they generate text one token at a time. A transformer produces the next word, feeds it back in, produces the word after that, and repeats - each step a full pass through a model that may hold hundreds of billions of parameters. The steps are expensive and strictly sequential. You cannot compute word ten until you have words one through nine. This is the core bottleneck of AI generation, and speculative decoding is one of the most elegant ways around it. It made the front page this week when DeepSeek's DSpark and the JetSpec project both pushed the idea further; see our news coverage.

The key insight: verifying is cheaper than generating

Here is the observation that makes the whole thing work. Generating tokens one by one is slow, but checking a batch of already-written tokens is fast - a transformer can score many positions in a single parallel pass. So what if something else did the slow guessing, and the big model only had to verify?

That is exactly the setup. You pair the big, smart, slow model (the target) with a small, cheap, fast model (the draft). The draft model runs ahead and proposes the next several tokens. Then the target model looks at all those proposed tokens at once, in one pass, and decides how many to accept. For predictable stretches of text - boilerplate, common phrasing, the obvious continuation of a sentence - the draft model guesses right, and the target confirms a whole chunk in a single step instead of grinding through it word by word. When the draft guesses wrong, the target catches the error at the first wrong token, throws away the rest, and the process restarts from there.

Why the output doesn't change

The beautiful part is that this is lossless in a precise sense: the final text is exactly what the target model would have produced on its own. The original speculative decoding and speculative sampling papers proved this with a clever accept/reject rule. The target doesn't blindly trust the draft; it compares the draft's probability for each guessed token against its own, and accepts or rejects in a way that mathematically guarantees the result matches the target's true distribution. The draft model can only change the speed, never the answer. A weak draft just means fewer guesses get accepted, so you fall back toward normal one-at-a-time speed - you never get worse output, only less speedup.

An analogy

Think of a meticulous senior editor who must approve every sentence of a manuscript. Alone, they write and approve one sentence at a time - thorough but slow. Now hire a fast junior writer to draft the next few sentences on a guess. The editor reads all of them in a single glance: sentences that match what they would have written get an instant checkmark, and at the first sentence that's wrong, the editor stops, fixes it, and the junior restarts from there. On routine passages the junior nails it and the pair flies. On hard passages the editor takes back over. The final manuscript is exactly what the editor would have written alone - it just got finished faster.

Variations worth knowing

The basic recipe has spawned a family of tricks. Tree drafting (the heart of this week's DSpark and JetSpec) has the draft propose not a single line of tokens but a whole tree of plausible continuations, so the target is more likely to find a branch it agrees with and can accept more tokens per pass. Self-drafting methods like Medusa skip the separate draft model entirely - they bolt extra lightweight prediction heads onto the big model so it drafts its own candidates, avoiding the hassle of training and serving two models. Others use a quantized or early-exit version of the model itself as the drafter.

Why it matters

Speculative decoding is pure operational leverage. It needs no retraining of the main model and changes none of its outputs, yet routinely delivers two-to-four-times faster generation, sometimes more. In a world where inference, not training, is where most AI money is actually spent - every chatbot reply, every agent step, every API call - cutting generation time directly cuts cost and latency. That's why a decoding-systems paper can outrank a flashy model launch among practitioners.

The honest caveats

The speedup is real but not a fixed number. How much you gain depends on the workload: highly predictable text gets large speedups, genuinely creative or surprising text gets less, because the draft model guesses wrong more often. The draft and target also have to be well matched - a draft that's too weak wastes effort, one that's too strong is itself slow. And while the math guarantees identical output in theory, practical implementations can introduce subtle bugs, which is why careful teams audit their speculative decoders against plain generation. Used well, though, it's close to a free lunch - and free lunches are rare enough in AI that this one is everywhere. To go deeper on the cost dynamics it exploits, read training vs. inference; to see why the underlying one-token-at-a-time loop exists, start with transformers.

Key papers
Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2022)
Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al., 2023)
Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads (Cai et al., 2024)