News · 2026-06-22
Two labs race to make AI write whole paragraphs at once instead of word by word
Almost every AI you've used writes the way you might text with one thumb: one word, then the next, then the next, each one waiting on the one before it. That left-to-right, one-token-at-a-time habit is the single biggest reason long AI responses feel slow. A different approach is now having a real moment, and this week it turned into a two-horse race. The approach is called diffusion, and instead of writing in sequence it drafts a whole block of text at once as a rough, garbled mess and then repeatedly cleans it up until it reads correctly -- a bit like a photo coming into focus all over at the same time, rather than being painted in from one corner.
The open-weight contender is Google's DiffusionGemma (model card), released under a permissive license so anyone can download and run it. Its calling card is speed: because it polishes text in parallel rather than one word at a time, it can produce output far faster than a conventional model of similar size. What's notable is how hungry people are for it -- it climbed near the top of the download charts within days even though, unusually, no big cloud company is yet offering it as a ready-to-use hosted service. That gap created a scramble of its own: the urgent question in the community became 'how do I run this myself,' and tooling sprang up to answer it, including fine-tuning support from Unsloth and a community-built local interface (diffusiongemma-lab).
The challenger comes from Inception Labs, whose Mercury 2 (inceptionlabs.ai) is a diffusion text model offered only as a hosted service, and which claims to be faster still. So you have a clean contest: an open model you can own but have to set up, versus a closed one you can't inspect but can call instantly -- both betting that parallel generation is the future of fast text. We've covered this paradigm before, in the story of a bigger text model that doesn't write left to right, and the underlying idea is laid out in our explainer on diffusion language models.
Why does writing-all-at-once matter? Because speed isn't a luxury -- it changes what's economically possible. A model that can generate a long document or a big chunk of code in a fraction of the time costs a fraction as much to run at scale, and feels qualitatively different to use: less waiting, more conversation. If diffusion text models keep their quality while running this fast, they could reshape the economics of anything that involves generating a lot of text -- summaries, code, drafts, translations -- and put real pressure on the one-word-at-a-time approach that has dominated since chatbots began.
A fair way to picture the trade-off: the traditional method is like a careful writer composing a sentence and only moving on once it's perfect -- reliable, but you watch every word appear. The diffusion method is like a sculptor starting with a rough block and chiseling the whole shape into focus at once -- potentially much faster, but you're trusting the cleanup process to land in the right place. Both can produce beautiful results; they fail in different ways.
The honest caveat is that speed is the easy part to demonstrate and quality is the hard part to prove. Generating text in parallel makes it trickier for the model to keep a long argument perfectly consistent, since it's not building strictly on what came just before. Researchers are still scrutinizing how these models hold up on long, reasoning-heavy tasks compared to the conventional kind -- and asking harder questions about how interpretable they are (How transparent is DiffusionGemma, and why it matters) -- and the speed claims -- especially the 'we're faster than them' kind traded between two competitors -- deserve independent testing before anyone treats them as settled. What's not in doubt is that parallel text generation has gone from a research curiosity to a real race, with one strong open option and one strong closed one pushing each other.