Learn · Intermediate

Sparse Attention

Sparse attention is a technique that lets a transformer skip the vast majority of the comparisons it would normally make between tokens. In a standard transformer, every token looks at every other token; sparse attention instead has each token look at only a carefully chosen subset. That single change is what turns a million-token context window from a compute nightmare into something you can actually ship -- and it is the machinery behind a wave of 2026 long-context models.

To see why it matters, you have to understand the cost of ordinary attention. The attention mechanism at the heart of every large language model works by comparing each token to every other token to decide what to pay attention to. If you have 1,000 tokens, that is a million comparisons; 10,000 tokens is 100 million. The cost grows with the square of the sequence length -- double the input and you quadruple the work. This 'quadratic' scaling is the wall that long context runs into. Reading a whole codebase or a long legal document means tens or hundreds of thousands of tokens, and full attention over that is brutally expensive in both compute and memory.

Sparse attention breaks the wall by exploiting a simple observation: most of those comparisons are nearly worthless. When you are reading, a given word depends heavily on the words right around it, moderately on a few important earlier passages, and almost not at all on the majority of the rest. So why compute a full-strength connection between every pair? Sparse attention builds a fixed or learned pattern of which tokens each token is allowed to attend to, and computes only those. Everything else is simply skipped.

The patterns come in a few recognizable flavors. A sliding window (or local) pattern lets each token attend to its neighbors within some fixed distance -- great for capturing local structure cheaply. A global pattern designates a handful of special tokens that everyone attends to and that attend to everyone, giving the model a few 'hubs' through which distant information can flow. A strided or dilated pattern skips at regular intervals to reach far-away tokens without connecting to all of them. Real systems combine these: the Longformer and Big Bird architectures pair local windows with a few global tokens, and the original Sparse Transformers paper from Rewon Child and colleagues at OpenAI introduced strided patterns to bring the cost down from quadratic to roughly linear-ish in sequence length.

Here is an analogy. Full attention is a conference where every person must have a one-on-one conversation with every other person before any decision is made -- fine for ten people, impossible for ten thousand. Sparse attention is how a real large organization works: you talk to your immediate team constantly (the local window), a few designated coordinators talk to everyone and relay what matters (global tokens), and you occasionally reach across to a distant department (the strided links). Almost all the important information still gets where it needs to go, at a tiny fraction of the conversations.

The cutting edge keeps refining which connections to keep and how to compute them efficiently. In 2026, models chasing usable million-token context lean heavily on sparse attention plus clever engineering. Z.ai's newly announced GLM-5.2, for instance, uses a trick it calls IndexShare that reuses the same attention indexer across groups of sparse-attention layers, cutting the per-token compute cost by nearly three times at full context length -- a reminder that sparse attention is not one fixed algorithm but an active design space.

The honest trade-off: sparsity means the model cannot, in principle, connect every token to every other, so a poorly chosen pattern can miss a genuinely important long-range dependency. The art is in designing patterns that keep the connections that matter. In practice, well-designed sparse attention loses very little quality while unlocking context lengths that dense attention could never reach, which is why nearly every long-context model relies on some form of it. It pairs naturally with the KV cache, another key trick for making long-context generation affordable, and it is one of the main reasons the effective size of a model's working memory has grown so fast. Understanding sparse attention is understanding how AI learned to read long things without going broke.

Key papers
Generating Long Sequences with Sparse Transformers (Child et al., 2019)
Longformer: The Long-Document Transformer (Beltagy et al., 2020)
Big Bird: Transformers for Longer Sequences (Zaheer et al., 2020)

Key questions

What problem does sparse attention solve?

It removes the quadratic cost of standard attention, where doubling the input length quadruples the compute, which is what otherwise makes very long context windows prohibitively expensive.

Does sparse attention hurt quality?

Usually only a little, because most tokens depend most strongly on nearby tokens plus a few important distant ones, so a well-chosen sparse pattern keeps the connections that matter and drops the ones that rarely do.

How is sparse attention different from a smaller context window?

A smaller window throws away distant text entirely, while sparse attention keeps the whole sequence in view but connects each token to only a subset of the others, preserving long-range reach at lower cost.

Topics: attention · transformers · long-context · efficiency · sparse-attention