Learn · Beginner

Scaling laws — does bigger always mean better?

One of the most consequential discoveries in modern AI is almost boringly simple: if you make a language model bigger, train it on more data, and spend more computing power, it gets predictably better. Not randomly better — better in a smooth, forecastable way you can plot on a graph. This relationship is called a scaling law, and for the better part of a decade it has been the engine driving the field. It's also why the question "is bigger always better?" has become one of the most important debates in AI.

What the laws actually say

The foundational Scaling Laws for Neural Language Models found that a model's performance improves as a steady mathematical function of three things: the number of parameters (the model's size), the amount of training data, and the compute spent training. Crucially, the improvement is predictable enough that you can estimate how good a model will be before you build it. That turned AI development from guesswork into something closer to engineering: you could plan a bigger model and forecast the payoff.

But "bigger" alone was the wrong lesson. The influential Chinchilla result showed that the field had been building models that were too big for the amount of data they were trained on. For a given compute budget, you get a better model by balancing size and data — a smaller model trained on more text often beats a larger model trained on less. That reframed the goal from "make it huge" to "make it compute-optimal," and it's the intellectual root of today's smaller, sharper open models.

The surprise: emergence

Scaling has a strange wrinkle. The Emergent Abilities work documented capabilities that are essentially absent in smaller models and then appear, sometimes abruptly, once a model crosses a certain scale — things like multi-step arithmetic or following intricate instructions. (Researchers still debate how much of this "sudden" appearance is real versus an artifact of how we measure it.) Either way, the practical lesson stuck: scaling doesn't just make existing skills sharper, it can unlock skills that weren't there at all.

An analogy

Scaling laws are like the relationship between studying and test scores. More hours generally means a better grade, in a fairly predictable curve — that's the law. But two things complicate it. First, how you study matters as much as how long: cramming the wrong material (too big a model, too little data) wastes the effort, which is the Chinchilla lesson. Second, some abilities only click after enough practice — you can't half-learn to ride a bike; one day it just works. That's emergence.

The limits — and the pushback

The scaling story has carried AI a long way, but it bends. Each new gain costs dramatically more compute than the last, and high-quality training data is finite. So the frontier is increasingly about getting more from less rather than simply going bigger. You can see this turn everywhere in current research: a world model that thinks in loops gets more capability not from more parameters but from re-running a small block, and the loud debate around a capable open model centers on the claim that brute size is no longer the path forward — that efficiency and grounding now matter more than raw scale. Whether or not that specific claim holds, it marks a real shift in mood. This connects to open-weight models too: the compute-optimal insight is exactly what makes small, runnable open models competitive.

Why it matters

Scaling laws explain the last decade of AI: the relentless growth of models was a rational response to a real, measurable pattern. But understanding the shape of the curve — predictable gains, the size-versus-data balance, the diminishing returns at the top — is what separates hype from sense. The next phase of progress is less about who can build the biggest model and more about who can get the most out of a given budget. Bigger has been better for a long time; it has just stopped being the only thing that matters.

Key papers
Scaling Laws for Neural Language Models (Kaplan et al., 2020)
Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)
Emergent Abilities of Large Language Models (Wei et al., 2022)