Learn · Beginner

Distillation: how a small AI learns from a big one

If you have followed the news that one AI lab accused another of "copying" its model, or wondered how a model small enough to run on a laptop can feel almost as sharp as a giant one, you have run into distillation. It is one of the most important ideas in modern AI, and once you see it, you notice it everywhere.

The teacher and the student

Start with a problem. The best AI models are enormous, expensive to run, and slow. You would love a smaller model that behaves almost as well but costs a fraction to operate. The obvious approach is to train the small model from scratch on the same data the big one learned from. It works, but the small model usually ends up noticeably dumber.

Distillation is a cleverer route. Instead of training the small model on the raw data, you train it to imitate the big model's answers. The large model becomes a teacher; the small model becomes a student that learns by watching the teacher work. This idea was crystallized in a landmark 2015 paper, Distilling the Knowledge in a Neural Network, by Geoffrey Hinton and colleagues at Google.

Why imitating answers beats studying the textbook

Here is the subtle part, and the reason distillation works so well. When a model answers a question, it doesn't just pick one option; internally it assigns a confidence to every possibility. Ask it whether a photo shows a husky, and it might be ninety percent sure it's a husky, but also slightly suspect a wolf, and barely consider a cat. That full spread of confidences is far richer than the bare correct answer "husky."

Hinton's team called this the "dark knowledge" hidden in a model's output. The fact that the teacher thinks a husky looks a little like a wolf but nothing like a cat teaches the student something about how the world is shaped, information that the one-word right answer in a textbook never contains. Learning from a knowledgeable teacher's hesitations and near-misses is like an apprentice watching a master chef taste a sauce and murmur "almost, needs acid", you absorb the judgment, not just the recipe. That is why a distilled student can reach quality that training on the raw data alone would not.

The most famous early demonstration was DistilBERT in 2019, which produced a language model roughly forty percent smaller and much faster than its teacher while keeping most of its ability. Distillation has been a workhorse of efficient AI ever since, and it is a close cousin of training on a model's outputs more generally, which connects to our lesson on synthetic data.

The twist that put distillation in the headlines

The original setup assumes you own the teacher and can peer inside its confidences. But there is a poorer-but-still-powerful version: even if you can only see a model's final text answers, the way anyone using a public AI service can, you can collect a huge pile of its question-and-answer pairs and train your own model to mimic them. You don't get the rich internal confidences, but you get an enormous amount of high-quality demonstration.

This is exactly the maneuver at the center of 2026's biggest AI-geopolitics story. When one lab accuses a rival of running a massive campaign to harvest millions of exchanges with its model through fake accounts, the alleged crime is distillation: not stealing the model's code or its internal weights, which would be outright theft, but training a competitor on its outputs. That legal and ethical grayness, it copies the behavior without copying the property, is precisely what makes it so contentious, and it feeds directly into the debate we cover in are closed AI models overpriced luxury goods?. It is also why the gap between expensive closed and cheaper open-weight models is so fraught: distillation is one way the cheap models can ride on the expensive ones' coattails.

What to take away

Distillation is a single idea wearing two faces. Used openly, it is how we get fast, affordable models that put capable AI on phones and laptops, an unambiguous good. Used to copy a competitor you don't own, it becomes an accusation of theft and a lever in trade politics. The mechanism is the same in both: a student model learning to imitate a teacher. The only thing that changes is whether you were invited to be the student.

Key papers
Distilling the Knowledge in a Neural Network (Hinton, Vinyals, Dean, 2015)
DistilBERT, a distilled version of BERT (Sanh et al., 2019)