Learn · Intermediate

What are diffusion language models?

Language models — the AI systems behind chatbots and writing assistants — almost universally work the same way: they produce one word at a time, left to right, and once a word is written, it stays. Each word is chosen based on all the previous words, and the model never revisits an earlier decision. This approach, called autoregressive generation, is fast, reliable, and well-understood. But it has a structural limitation: if the model writes a wrong assumption in the middle of a reasoning chain, every word that follows gets built on that mistake.

Diffusion language models are a different approach, inspired by a technique that first proved powerful in image generation. The word "diffusion" comes from physics — imagine ink dropped into water, slowly spreading until it's uniformly distributed. In image generation (the technology behind systems like Stable Diffusion), the process works in reverse: start with an image of pure random noise, and repeatedly remove a little noise until a coherent picture emerges. Each step makes the image a bit clearer; after enough steps, you have a recognizable image.

In a diffusion language model, the same idea applies to text. Instead of starting blank and writing left to right, the model starts with a corrupted version of the output and iteratively cleans it up. The exact form of corruption varies, giving rise to two main families that work quite differently.

Masked diffusion replaces some tokens (words or word-pieces) with a special [MASK] placeholder and trains the model to predict what should fill each blank given the rest of the sentence. This is conceptually similar to fill-in-the-blank — but extended to generation: during inference, the model starts with everything masked and iteratively unmasks positions in an adaptively chosen order, filling in the slots it's most confident about first. Crucially, once a slot is filled, it stays filled. The Large Language Diffusion Models (LLaDA) paper established a strong open baseline for this approach at scale.

Uniform diffusion is more general. Instead of replacing tokens with blank markers, the forward process replaces each token with a randomly chosen real word from the vocabulary. Corruption is a random walk through actual words rather than a transition to a special placeholder. This means the reverse process — generation — can change any word at any step, including words it "decided" on two steps ago. No word is ever truly final until the generation process ends. Sumi (2026) from Tohoku University is the first large-scale from-scratch model of this type, providing an open reference point for studying the approach.

The key structural difference from standard language models is that diffusion LMs generate all positions simultaneously in each step, rather than sequentially one at a time. This means they are naturally bidirectional — the model sees the full (partially noisy) sequence when deciding how to denoise each position, not just the tokens that came before it. This gives them a fundamentally different relationship between different parts of the output than standard left-to-right models have.

Why does revisability matter? In principle, a model that can revise its intermediate reasoning — detecting an early error and correcting it before it propagates — could produce more reliable outputs than one locked into its first choices. This is analogous to the difference between writing a first draft and editing it versus committing every sentence permanently as you type it. The possibility of self-correction has driven significant research interest in diffusion LMs.

In practice, however, whether this revisability is actually useful is an open and unsettled question. Sumi's research found a sharp negative result: despite having the mechanical ability to revise any word at any step, the model didn't do anything useful with that ability. Revisions were mostly round-trips — changing a word from A to B and then back to A — with no net improvement in the answer. The revisability exists structurally but is not being exploited.

This leaves two possibilities: either the right training objective hasn't yet been found to elicit useful revision, or revisability is inherently difficult to learn and may not yield substantial benefits at current scales. If someone finds the training objective that activates useful self-correction, uniform diffusion becomes the most architecturally flexible text AI available. If no one does, masked diffusion is likely to win the open non-autoregressive competition by default, having demonstrated strong capabilities at scale without the additional complexity.

Current diffusion language models at comparable training budgets can match standard autoregressive models on many tasks, but trail the very best autoregressive models at scale. The gap is real and the field is working on it. For anyone interested in how AI generates text, diffusion LMs represent the most serious architectural alternative to the left-to-right paradigm — and whether they close that gap is one of the more interesting open bets in the field.

For related coverage, see news about the Sumi uniform diffusion model and our explainer on RL post-training, another major technique for improving language models after initial training.

Key papers
Large Language Diffusion Models / LLaDA (2025)
Sumi: Open Uniform Diffusion Language Model (2026)