attention
The KV cache: why AI gets slower and hungrier the longer it talks Lesson
The hidden notebook that lets a model avoid re-reading every previous word - and the single biggest reason long context is expensive.
DeepSeek's new open models give everyone a million-word memory by default News
DeepSeek previewed two free-to-download V4 models that can read a million tokens at once, no longer as a premium add-on but as the standard setting.
Transformers: the engine inside almost every modern AI Lesson
The neural-network design behind GPT, Claude, and nearly every modern AI model, and the one idea, attention, that made it work.
A small but elegant idea: putting 'experts' inside the attention layer News
Grouped Query Experts brings the mixture-of-experts trick into attention, activating only half a model's query heads per token while matching the full version -- at least at small scale.
A Classic Efficiency Trick Just Moved Into a New Part of the AI News
For years, the committee-of-specialists design that keeps big models fast lived in one layer of the network. A clean new result shows it works in the attention layer too, halving some of the work for free.