Learn · Beginner

Mixture of Experts: The Committee Inside a Giant Model

If you have read that a new AI model has 'seven hundred billion parameters' but also that it runs surprisingly cheaply, you have run into a small mystery. Parameters are the model's adjustable knobs, the place its knowledge lives, and more of them usually means slower and more expensive to run. So how can a model be enormous and quick at once? The answer, in nearly every large model shipping today, is an idea called mixture of experts -- and once you see it, a lot of modern AI starts to make sense.

The core idea: don't wake the whole brain for every word

A traditional neural network runs all of itself for every single thing it does. Every word you feed it touches every parameter. That is simple, but it is wasteful: it is like making every employee in a giant company attend every meeting, even the ones about topics they know nothing about. As models grew, that all-hands-for-everything design became the bottleneck. You wanted more knowledge in the model, but more knowledge meant more parameters, and more parameters meant every word got slower and pricier to process.

Mixture of experts breaks that link. Instead of one big dense network, the model contains many smaller sub-networks called experts -- think of them as specialists. In front of them sits a small, fast traffic cop called a router. For each word, the router looks at what is coming through and picks just a few experts to handle it, while the rest stay asleep. The model might hold dozens or hundreds of experts in total, but only a small handful actually fire for any given word.

The payoff is the whole point. The model's total size -- its total knowledge -- can be gigantic, because you can keep adding experts. But the cost of running it stays modest, because you only ever pay to run the few experts the router woke up. This is why you will see two numbers quoted for these models: a huge 'total parameters' figure and a much smaller 'active parameters' figure. The first is how much the model knows; the second is how much of it runs per word. A model like GLM-5.2 might have hundreds of billions of total parameters but only activate a fraction of them at a time. Researchers call this 'conditional computation' -- the computation you do depends on the input.

A newsroom analogy

Imagine a magazine with a huge pool of specialist writers -- a science writer, a sports writer, a food critic, a finance reporter, and a hundred more. A traditional dense model is like making the entire pool collaborate on every single article, even a recipe. Slow, and most of them have nothing to add.

A mixture-of-experts model is like having a sharp editor (the router) who reads each assignment and sends it to just the two or three writers who actually know the subject. The magazine still has the combined expertise of all hundred writers -- you can call on any of them when the topic fits -- but any individual article only ever occupies a few of them. You get the depth of a huge staff at the cost of a small one.

Where the idea came from, and where it lives

The modern version of this idea was introduced in 2017 in a paper memorably titled Outrageously Large Neural Networks, which showed you could build a layer out of thousands of expert sub-networks and route between them. A few years later, GShard and then Switch Transformers showed the trick could scale to staggering sizes -- trillions of parameters -- while keeping the per-word cost manageable, and worked out the engineering to spread all those experts across many chips. That lineage is the direct ancestor of today's biggest open and closed models alike.

Until recently, the experts almost always lived in one specific part of the network: the dense 'thinking' layer that processes each word after it has weighed the others. But the idea is general, and it is starting to spread. A 2026 result we covered, where the committee structure moved into the attention layer, is a sign that researchers are finding new places to apply the same logic. We also told the story of one model that is really a committee if you want to see the idea in a single concrete system.

Why it matters

Mixture of experts is one of the main reasons the scaling story has been able to continue. It is how labs keep making models that know more without making them proportionally slower and costlier to run, and it is a big part of why capable open-weight models you can download have caught up so fast -- the design lets a community-released model carry frontier-scale knowledge while staying runnable on real hardware. Nearly every model topping the charts today uses it.

The honest caveats

Mixture of experts is not free magic. The router has to learn to send each word to the right experts, and getting that routing to train smoothly is genuinely tricky -- early systems suffered from experts that got overloaded while others sat idle, and a lot of the engineering is about balancing the load. There is also a memory cost that the speed numbers hide: even though only a few experts run per word, all of them have to be kept loaded and ready, so these models are memory-hungry even when they are compute-light. That is part of why a model can be 'cheap to run' in terms of computation and still demand an expensive rack of chips just to hold it -- the gap between an open license and the closed hardware needed to actually use it. Understanding mixture of experts is the key that unlocks why those two numbers -- total size and active size -- are both worth paying attention to, and why the modern giant models are less like a single mind than a well-managed crowd.

Key papers
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017)
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Lepikhin et al., 2020)
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Fedus et al., 2021)