News · 2026-06-29

DeepSeek's new open models give everyone a million-word memory by default

DeepSeek has previewed its V4 family, and the headline you'll hear is the size: a flagship model with 1.6 trillion total parameters. But the number that actually changes how people work is smaller and stranger. Every official DeepSeek service now defaults to a context window of one million tokens - roughly a few full-length books' worth of text - and the models come with weights you can download and run yourself.

First, the background a newcomer needs. A large language model doesn't have a memory in the human sense. Each time it answers, it re-reads everything in front of it - your question, the conversation so far, any documents you pasted - and that pile of text is called the context. The context window is the hard ceiling on how much it can hold at once. For years that ceiling was a few thousand words, then tens of thousands. Pushing it to a million has been possible but expensive, usually sold as a special, pricey tier. DeepSeek's move is to make a million the everyday default.

The family comes in two sizes. V4-Pro is the big one - 1.6 trillion parameters in total, but only about 49 billion of them switch on for any given word. That design is called a mixture of experts: instead of running the entire brain for every token, the model routes each piece of text to a small relevant subset of specialists, so it stays affordable to run despite its enormous size. V4-Flash is the smaller, cheaper, faster sibling, meant for everyday chat and quick edits, and DeepSeek says it keeps up with Pro on simpler agent tasks.

So how do you read a million tokens without the cost exploding? Here's the part that matters. When a model processes a long text, it stores a running set of notes about every previous word - this is the KV cache, and it grows steadily the longer the conversation gets. At a million tokens those notes become a mountain of memory, and the model normally has to consult every note for every new word it writes. DeepSeek's trick, which they call sparse attention plus token-wise compression, is to stop doing that. Think of reading a long report: you don't re-read every sentence to write the next one, you skim back to the few parts that matter and keep a compressed gist of the rest. The model does the equivalent - it attends to a sparse, relevant slice of the past and compresses the rest - which is what makes a million-token window cheap enough to leave switched on for everyone.

Why it matters: long context is the foundation under a lot of useful work. Feeding an AI an entire codebase, a stack of legal contracts, a year of email threads, or a long research transcript all depend on how much it can hold at once. Making a million tokens the floor rather than a luxury lowers the bar for everyone building those tools - and because the weights are open, smaller labs and individuals get access to frontier-scale long context without paying a closed provider per token. DeepSeek also says V4-Pro leads open models in math, science, and coding and trails only the very top closed model on general world knowledge, which keeps narrowing the gap between open and closed AI.

The honest caveat is the difference between a model that supports a million tokens and one that uses them well. Long-context models have a well-documented habit of paying close attention to the beginning and end of a huge input while glossing over the middle - the so-called lost-in-the-middle problem - and aggressive sparse attention can make that worse, because skimming is exactly the behavior that risks missing a buried detail. All of DeepSeek's quality claims also come from DeepSeek's own report; nobody outside the company has independently stress-tested the million-token recall yet. So treat the window as a real and welcome capability, but wait for outside long-context retrieval tests before trusting it to never drop the one sentence that mattered on page 400. One practical note for anyone already building on DeepSeek: the older chat and reasoner endpoints retire on July 24, with traffic shifting to V4-Flash, so existing integrations will need a look.

Primary source, verified: read the paper →