News · 2026-06-27

The trick that makes AI type faster just hit the top of Hacker News

Large language models are slow in a specific, frustrating way: they write one word at a time. To produce the next token, the model runs a full, heavy computation; then it does it again for the token after that, and again, all the way to the end of the answer. Each step is expensive and they happen one after another. This is why a long response takes the time it does, and why running these models at scale costs real money. This week, a decades-quiet corner of AI engineering - how to make that one-word-at-a-time loop faster - jumped to the top of Hacker News, where DeepSeek's new DSpark project pulled in more than seven hundred points, alongside a related effort called JetSpec. Both are chasing the same prize: make the model type faster without changing a single word it produces.

The trick is called speculative decoding, and it's elegant. You pair the big, slow, smart model with a small, fast, cheaper one - the draft model. The little model races ahead and guesses the next several words. Then the big model checks all those guesses at the same time, in a single pass, instead of generating them one by one. When the guesses are right - and for easy, predictable stretches of text they usually are - the big model confirms a whole chunk at once and skips the slow step-by-step grind. When a guess is wrong, the big model catches it and corrects course. The output is exactly what the big model would have written alone, just produced in fewer slow rounds. We wrote a full plain-language explainer on how speculative decoding works.

Here's the analogy. Picture a careful senior editor who has to approve every sentence of a document. Working alone, they write and approve one sentence at a time - thorough but slow. Now give them a fast junior writer who drafts the next few sentences on a guess. The editor reads all of them in one glance: the ones that match what they would have written get a checkmark instantly, and the moment one is wrong, the editor stops, fixes it, and the junior starts fresh from there. On routine passages the junior nails it and the pair flies; on tricky passages the editor takes back over. Same final document, far less waiting.

What DSpark and JetSpec add is going wide instead of straight. Classic speculative decoding has the draft model guess a single line of words. The newer approach drafts a whole tree of possible continuations - several plausible next paths at once - so when the big model verifies, it's more likely to find a branch it agrees with and can accept even more words per pass. JetSpec's project page and the Hao AI Lab writeup cite speedups of up to eight times, with the underlying paper reporting a range from roughly two to eight times depending on the model and the task. The corresponding research is posted as a paper on parallel tree drafting.

Why this matters is money and latency, the two things every company running AI actually feels. A two-to-four-times speedup with no retraining and no quality loss isn't an academic curiosity; it's a smaller cloud bill and a snappier product, which is precisely why systems-level inference work, usually invisible to the public, briefly outranked flashy model launches on a hacker forum.

But the honest caveat is the part the community is busy stress-testing, and it's worth taking seriously. The headline word is "lossless" - the promise that output is identical to running the big model alone. In theory that's true: the big model only ever emits a word it would have chosen anyway, so the guessing can't change the answer, only the speed. In practice, on the popular forum for running models locally, people report the picture is nuanced. The speedups are real, but the size of the speedup swings hard with the workload, and some users see minor quality wobble on complex tasks when the little draft model is too weak to guess well. The resolution is subtle: "lossless" describes the decoding rule, and it holds as long as the draft actually contains the word the big model wants; the magnitude of the gain is never a fixed eight times - it depends entirely on the model pair, the batch size, and how predictable the text is. Treat "up to 8x" the way you'd treat any "up to" number: real, achievable in the best case, and not a promise for your case. For the bigger picture on why inference cost dominates AI economics, see training versus inference.

Primary source, verified: read the paper → (arXiv 2606.18394)