Ground Truth.
AI, checked against the source.

Learn · Intermediate

Synthetic Data: When AI Makes Its Own Training Material

There is a quiet crisis behind the AI boom: we are running low on the thing that made it possible. Large models learned to write, reason, and code by reading a staggering amount of human text -- most of the public internet. But that supply is finite, much of it is low quality, and the best of it has largely been used. So the field has turned to a striking alternative: having AI generate or reshape the data that trains the next AI. This is called synthetic data, and it has gone from a curiosity to a central ingredient in nearly every frontier system. Three new pieces of research this week -- on agents that simulate their own practice worlds, a model that tailors raw streams into training material, and an open recipe for curating agent data -- are all variations on this one idea. It is worth understanding on its own.

What 'synthetic data' actually means

The phrase covers a spectrum. At one end is fully generated data: you ask a capable model to write thousands of new examples -- questions and answers, worked problems, code with explanations -- and train a model on them. At the other end is reshaped data: you take real, messy material and have a model clean it, label it, summarize it, or restructure it into something easier to learn from. Both are 'synthetic' in the sense that a machine, not a human, did the work of turning raw material into a lesson.

The simplest analogy is a study guide. Imagine a brilliant student who has read an entire messy library and then writes clean, well-organized practice problems for a younger student. The younger student might learn faster from those tailored problems than from the original chaotic library -- as long as the older student actually understood the material and didn't introduce errors. That is the promise and the peril of synthetic data in one image.

How it became essential

Three ideas built the foundation. Self-Instruct showed in 2022 that a model could generate its own instruction-and-response examples and then train on them to become dramatically better at following instructions -- bootstrapping a skill almost from scratch. Around the same time, STaR showed a model could improve its reasoning by generating step-by-step solutions, keeping the ones that reached the right answer, and training on those -- learning to reason by practicing reasoning. Then Textbooks Are All You Need made the most provocative claim: a relatively small model trained on a modest amount of carefully synthesized, textbook-quality data could rival much larger models trained on far more raw web text. The lesson across all three: quality and structure of data can matter as much as sheer quantity -- a direct complement to the scaling laws that say quantity matters too.

This is the heart of what people now call data-centric AI: the realization that improving the data is often a better lever than improving the model. The work this week pushes it further by making data preparation itself a learned, automated skill rather than a human chore. When an agent practices in a simulated world it built, the experience it gathers is synthetic. When a model refines raw video into dense training examples, the output is synthetic. The human is moving out of the inner loop.

Why it matters

Synthetic data does three things that are hard to get otherwise. It supplies more material when human data runs out. It lets you target specific weaknesses -- generate exactly the kind of hard math or rare edge case a model struggles with. And it is a key engine of reinforcement learning post-training, where models improve by generating attempts and learning from the good ones. It is also a big reason capable open-weight models have caught up so fast: a strong open model can generate training data to teach the next one. Push this loop far enough and you arrive at the doorstep of recursive self-improvement -- systems that improve the very material they learn from, and eventually themselves.

The honest danger: model collapse

Synthetic data is not free lunch, and the failure mode is serious. If a model learns mostly from data generated by models, errors and biases can compound across generations -- a phenomenon researchers call model collapse. Picture a photocopy of a photocopy of a photocopy: each pass looks fine, but the artifacts accumulate until the image degrades into mush. A model that trains on its own confident mistakes can amplify them, narrow its own diversity, and forget the long tail of rare-but-real cases that only human data contained. The study-guide analogy returns with teeth: if the older student misunderstood a topic, every younger student inherits the misunderstanding, and no one in the chain ever checks against the original source.

This is why the best synthetic-data systems keep a tether to reality -- filtering generated examples against real answers (as STaR does), grounding them in verifiable facts, or mixing synthetic with fresh human data rather than replacing it. The open question raised by this week's automation push is exactly this: when a model both makes its training data and decides what counts as good, who audits what it quietly bakes in? Synthetic data is one of the most powerful tools in modern AI. Used with a reality check, it extends what models can learn. Used as a closed loop with no ground truth, it is a slow way to teach a model its own blind spots.

Key papers
Self-Instruct: Aligning Language Models with Self-Generated Instructions (Wang et al., 2022)
STaR: Bootstrapping Reasoning With Reasoning (Zelikman et al., 2022)
Textbooks Are All You Need (Gunasekar et al., 2023)