News · 2026-06-30
Ollama nearly doubles Gemma's speed on Macs by guessing ahead
Ollama shipped an update this week that nearly doubles the speed of Google's Gemma model on Apple computers, using a technique called multi-token prediction. The approach preserves model output exactly — identical answers, delivered faster — with coding benchmarks showing nearly 90% faster generation. The company detailed the work in a technical blog post.
Key facts
- What: A free local-AI tool now runs Google's Gemma model far faster on Apple computers using a trick where a small model drafts words and the big one checks them in bulk.
- When: 2026-06-30
- Primary source: read the source
Multi-token prediction is a close cousin of speculative decoding, which we cover in a full lesson. The core insight: generating text one word at a time wastes the machine's muscle. Each step, the big model does a huge amount of computation just to produce a single next word. But much of what comes next is easy to guess — after an opening bracket comes a closing one, after a common phrase comes its usual ending. Instead of making the big, expensive model do every step alone, Ollama puts a small, fast model beside it. The little one races ahead and drafts several likely next words. Then the big model checks that whole batch in a single pass, keeping the guesses it agrees with and discarding the rest. Because verifying several words at once costs barely more than producing one, every correct guess is essentially free speed.
Think of it like a senior editor and a fast junior writer. The junior scribbles the next few words they expect; the editor glances at the batch and, wherever the junior nailed it, waves it through without rewriting. The final text is exactly what the editor would have written alone — the editor still approves every word — but it gets produced far faster because the easy parts were pre-filled. That is why Ollama can promise a big speedup without changing the model's output at all: the answers are identical, they just arrive quicker.
Code benefits the most, and that is not a coincidence. Programming is unusually predictable — it is full of closing brackets, standard boilerplate, and repeated structure — so the little draft model guesses right a large fraction of the time, and each correct guess pays off. On coding benchmarks, Ollama reports nearly ninety percent faster generation, with one measurement jumping from about fifty words a second to ninety-five. For someone using a local model as a coding assistant on a Mac, that is the difference between a tool that feels sluggish and one that feels responsive.
The engineering details are where the real work went. Ollama tuned the system so it automatically adjusts how many words the draft model guesses at a time, based on how often its guesses are being accepted and how long verification takes — so when the draft model stops being useful, the system quietly falls back to normal generation rather than wasting effort. The whole loop — drafting, sampling, and verifying — runs on the computer's graphics chip in a single pass, avoiding the slow handoffs between different parts of the machine that often bottleneck local AI. And notably, Ollama contributed a specialized piece of code to Apple's underlying MLX framework to make the batch-checking step efficient. That kernel handles small batches of a handful of words at a time and reuses the model's weights across them, and because it lives in the shared framework, other models running on Apple hardware can benefit too — a rare case of one tool's optimization lifting the whole local-AI ecosystem.
The honest caveat: the headline number is a best case. Nearly ninety percent faster is a coding-benchmark figure, and coding is the friendliest possible task for this method because of how predictable it is. On free-form prose, where the next word is genuinely hard to guess, the draft model is wrong more often, its guesses get thrown away, and the speedup shrinks — sometimes a lot. The technique never makes generation slower, because it falls back gracefully, but the two-times gain is specific to structured, repetitive text. Still, for the growing crowd running capable models on their own machines — see our lesson on open-weight models for why that crowd is growing — a large, free, output-preserving speedup on the most common local task is exactly the kind of unglamorous improvement that makes local AI more usable day to day.
Key questions
Comments are replies to this story on Bluesky — reply with any Bluesky account to join in.