News · 2026-06-25

A language model that doesn't write left to right

Almost every AI model you have used writes the way you might expect: one word at a time, left to right, each word chosen based on everything before it. That approach is called autoregression, and it has been so dominant that it can feel like the only way to do it. A new model called iLLaDA, described in a paper on arXiv with code and weights on GitHub, is a reminder that it isn't.

iLLaDA is a diffusion language model. The idea is borrowed from the AI image generators that took over the internet. Those tools start with pure visual static and repeatedly clean it up until a picture emerges. A diffusion language model does the same thing with text: instead of placing words one by one, it starts with a sentence that is mostly blanked out and fills in the gaps over several passes, refining the whole thing at once until coherent text appears. If you want the full background, we have a lesson on diffusion language models, and we have covered this line of work before in text that arrives all at once.

The practical appeal is twofold. Because a diffusion model works on the whole sentence simultaneously rather than waiting for each previous word, it can in principle generate in parallel, which opens a door to faster output. And because it isn't locked into strict left-to-right order, it is naturally good at filling in a blank in the middle of existing text, the way you might edit a document, rather than only ever adding to the end. Editing and revision come more naturally than they do to a model that can only ever look backward.

The catch, historically, has been quality. Diffusion language models have been interesting research curiosities that couldn't quite keep up with the left-to-right mainstream on hard tasks. What makes iLLaDA notable is how much that gap has narrowed. It is a mid-sized model, eight billion parameters, trained from scratch entirely as a diffusion model, and across a broad spread of tasks, general knowledge, math, and writing code, it improves substantially over the previous model in its line. More tellingly, its makers report it now holds its own against a well-regarded conventional model of similar size. We are deliberately not quoting the benchmark numbers here, because a raw score on a test most people have never heard of carries little meaning; what matters is the trend, and the trend is a genuinely non-autoregressive model reaching roughly the same league as the autoregressive ones at this scale.

A couple of details give the result more credibility than a typical demo. The team kept the diffusion approach all the way through, both the initial massive training and the later fine-tuning on instructions, rather than quietly switching back to conventional methods for the polish. They also released the weights and code openly, so others can poke at the claims directly.

Why it matters: for years the assumption has been that serious language ability requires the one-word-at-a-time recipe. iLLaDA is one more data point that this is an engineering habit, not a law of nature. If diffusion models can match conventional ones at small scale and then scale up while keeping their parallel-generation and editing advantages, that would be a real shift in how language models are built and served.

The honest caveat: "competitive with a strong conventional model" is the authors' framing, and the comparison depends heavily on which model and which tasks. Diffusion language models have also tended to trade away some efficiency to get their parallelism, so the open question is whether iLLaDA's wins survive at the size of a true frontier model and under the cost pressures of real-world serving. An 8-billion-parameter result is a strong signal. A frontier-scale diffusion model that beats the best autoregressive ones would be the actual event. For now, the door that many assumed was closed is visibly open.

Primary source, verified: read the paper → (arXiv 2606.25331)