Learn · Intermediate
Quantization: Shrinking AI Models to Run on Modest Hardware
A large language model is, underneath, a giant pile of numbers - the weights learned during training. A model with billions of parameters has billions of these numbers, and how you store each one decides how much memory the model eats and how fast it runs. Quantization is the art of storing those numbers with less precision so the whole model gets smaller and faster, ideally without getting noticeably dumber. It's the single biggest reason a model that nominally needs a data-center GPU can end up running on a gaming card or even a laptop - and why the open-weight model community obsesses over it.
Precision, in plain terms
Computers store numbers using a fixed number of bits, and more bits mean finer detail. Models are usually trained using 16-bit numbers for each weight - enough precision to capture small distinctions. Quantization asks: do we really need all that detail just to run the model? Often the answer is no. You can squeeze each weight down to 8 bits, 4 bits, or in aggressive cases even fewer, and the model keeps working. The win is direct and large: going from 16 bits to 4 bits cuts the model's memory footprint by roughly four times. A model that needed 80 gigabytes of memory might fit in 20. That's the difference between "requires expensive specialized hardware" and "runs on a card you can actually buy."
An analogy
Imagine describing the temperature outside. You could say "23.7194 degrees" - very precise, but a mouthful, and mostly wasted detail. Or you could say "about 24 degrees." For deciding what to wear, the rounded version is just as useful and far easier to carry around. Quantization rounds the model's numbers in exactly this spirit: it throws away precision that doesn't change the model's behavior much, keeping the storage cheap. The risk, of course, is rounding too hard - say "warm" instead of "24 degrees" and you've lost something that mattered. The whole craft of quantization is rounding as aggressively as possible while staying on the right side of that line.
Why it usually works (and where it breaks)
Neural networks turn out to be surprisingly tolerant of imprecision - they were trained with noise and redundancy baked in, so small rounding errors in most weights wash out. But the tolerance isn't uniform. A landmark finding, from the LLM.int8() paper, is that a tiny fraction of weights and activations - the "outliers" - carry outsized importance, and crushing those wrecks the model. The fix was to handle the rare big values in higher precision and the common small ones in low precision. That insight - not all numbers are equal - runs through the whole field. GPTQ quantizes weights carefully one group at a time, correcting for the error introduced as it goes, to hit 4-bit with minimal damage. AWQ protects the weights that matter most by looking at which ones the model's activations actually lean on. These are post-training quantization methods: they shrink an already-trained model without retraining it.
Quantizing for training, not just running
Quantization isn't only for inference. QLoRA showed you can keep a big model frozen in 4-bit form and train a small set of add-on weights on top of it, making it possible to fine-tune a model far larger than your hardware would otherwise allow. This is part of why customizing capable models got cheap enough for hobbyists and small teams - it pairs naturally with the lightweight fine-tuning ideas covered in reinforcement learning post-training.
Why it matters
Quantization is one of the great democratizers of AI. It decouples "what model can I use" from "can I afford a rack of data-center GPUs." It's the reason the local-inference community can run capable models at home, the reason phones can host small models offline, and a big lever on the cost of running AI at scale. When this week's discussions argued over whether a 3-bit or 1.5-bit model is genuinely useful or just a benchmark stunt, that's a quantization debate - and the same question hangs over whether agentic systems like Qwen-Image-Agent can be shrunk to run locally without losing the reasoning that's their whole point.
The honest caveats
Quantization is a trade, not a free lunch, and the trade gets steeper the harder you push. Eight-bit is nearly free; four-bit is usually fine with good methods; below that, quality starts to slip in ways that don't always show up on quick benchmarks but appear on hard reasoning, long contexts, or rare knowledge. The very low-bit claims (2-bit, 1.5-bit) are where skepticism is healthiest - a model can look fine on simple prompts and quietly fall apart on the cases you care about. The right amount of quantization depends on the model, the task, and how much quality you can afford to lose. Used sensibly, it's one of the highest-leverage tricks in all of practical AI; pushed recklessly, it's a fast way to make a smart model stupid in hard-to-notice ways. For the broader context of why people want to run these models themselves, see open-weight models.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022)
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022)
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
AWQ: Activation-aware Weight Quantization (Lin et al., 2023)