Ground Truth.
AI, checked against the source.

Learn · Intermediate

The KV cache: why AI gets slower and hungrier the longer it talks

When DeepSeek's new V4 models made a million-token context window the default, the achievement wasn't really about reading more - it was about taming something called the KV cache. The KV cache is one of the most important ideas in how AI actually runs, and almost nobody outside the field has heard of it. Here's what it is and why it governs the speed and cost of every chatbot you use.

Start with how a model writes. A language model built on the transformer architecture generates text one token (roughly, one word-piece) at a time. To pick the next token, it uses attention: it looks back over everything written so far and decides which earlier tokens are relevant right now. Mechanically, every previous token gets boiled down into two vectors - a Key (a kind of address that says 'here's what I'm about') and a Value (the actual content it contributes). The model compares the current token's Query against all those Keys to decide where to focus, then pulls in the matching Values.

Now the problem. To generate token number 1,000, the model needs the Keys and Values for tokens 1 through 999. To generate token 1,001, it needs tokens 1 through 1,000. If it recomputed all of those from scratch every single time, generating a long passage would be staggeringly wasteful - the work would balloon with the square of the length. So it doesn't. It computes each token's Key and Value once, then stashes them in memory and reuses them for every future token. That stash is the KV cache. It's the model's running notebook: write each token's Key and Value down once, glance back at the notebook instead of re-deriving everything.

The analogy: imagine writing a long essay where, before adding each new sentence, you had to re-read and re-summarize the entire essay so far. By the tenth page that's crushing. Instead, you keep a margin of notes - one line per paragraph - and skim the notes. The KV cache is those margin notes, and it's the difference between generation that's merely slow and generation that's impossibly slow.

The catch - and this is the whole story behind long-context economics - is that the notebook grows. Every new token adds another Key and another Value to the cache, for every layer of the model. At a few thousand tokens it's fine. At a million tokens, the KV cache can swell to many gigabytes of fast memory, often dwarfing the model's own weights. This is why long conversations get slower and pricier as they go, why running long context needs expensive high-memory hardware, and why 'supports a million tokens' has historically meant 'supports it if you can afford the memory.'

So a huge amount of engineering goes into shrinking the KV cache. A few of the big levers:

Multi-Query and Grouped-Query Attention. In a standard model, every attention 'head' keeps its own separate Keys and Values, multiplying the cache. The trick (introduced in the Multi-Query and later Grouped-Query Attention papers) is to let many heads share one set of Keys and Values. That can cut the cache several-fold with little quality loss, and it's now standard in most modern models.

Sparse attention. Instead of every token attending to all previous tokens, the model attends only to a relevant subset - skimming, not re-reading. This is the lever DeepSeek-V4 pulled (they call it sparse attention plus compression) to make a million-token window affordable enough to leave on by default.

Quantization. Store the cached Keys and Values in a lower-precision number format - say 8 or even 4 bits instead of 16 - roughly halving or quartering the memory at a small accuracy cost.

The KV cache also explains a feature you've probably benefited from without noticing: prompt caching. When a provider lets you reuse a long fixed prompt cheaply across many requests, what they're often doing is saving the KV cache for that shared prefix so it never has to be recomputed.

The KV cache even shapes other techniques. Speculative decoding, which speeds up generation by having a small model draft tokens for a big one to check, has to carefully manage the cache so the drafts and the verification stay consistent. And the broader pressure to fit more into the context window is, underneath, a constant fight against KV-cache growth.

The one-sentence takeaway: a model's real working memory during a conversation isn't an abstract 'context' - it's a concrete, growing cache of Keys and Values, and almost every advance in long-context AI is, at heart, a smarter way to keep that cache small.

Key papers
Attention Is All You Need (Vaswani et al., 2017)
Fast Transformer Decoding: One Write-Head is All You Need (Shazeer, 2019)
GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al., 2023)