efficiency
Fine-tuning and LoRA: teaching an old model a new job without retraining it Lesson
You almost never train an AI from scratch. You take one that already knows language and nudge it toward your specific task - and a trick called LoRA lets you do that by adding a tiny sticky note instead of rewriting the whole brain.
The KV cache: why AI gets slower and hungrier the longer it talks Lesson
The hidden notebook that lets a model avoid re-reading every previous word - and the single biggest reason long context is expensive.
The trick that makes AI type faster just hit the top of Hacker News News
A small model guesses ahead and a big model checks the work in parallel - and this week two efforts pushing that idea, DeepSeek's DSpark and JetSpec, lit up the front page while the community argued over whether it's truly 'lossless.'
Speculative Decoding: How AI Types Faster Without Changing a Word Lesson
A small, fast model guesses the next few words and a big, slow model checks them all in one pass - producing the exact same output, just quicker. The trick behind a lot of modern AI speedups.
Quantization: Shrinking AI Models to Run on Modest Hardware Lesson
Storing a model's numbers with less precision - 8, 4, or even fewer bits instead of 16 - makes it dramatically smaller and faster, often with almost no loss in quality. It's why big models can run on a laptop or a single GPU.
Distillation: how a small AI learns from a big one Lesson
Distillation trains a smaller, cheaper model to imitate a larger, smarter one, the idea behind both efficient deployment and the 'copying' accusations now driving AI geopolitics.
Mixture of Experts: The Committee Inside a Giant Model Lesson
Why the biggest AI models are not really one big brain but a large team of specialists, only a few of whom wake up for any given word -- the trick that lets a model be huge and fast at the same time.
A small but elegant idea: putting 'experts' inside the attention layer News
Grouped Query Experts brings the mixture-of-experts trick into attention, activating only half a model's query heads per token while matching the full version -- at least at small scale.
A Classic Efficiency Trick Just Moved Into a New Part of the AI News
For years, the committee-of-specialists design that keeps big models fast lived in one layer of the network. A clean new result shows it works in the attention layer too, halving some of the work for free.
Teaching AI with rewards — minus the expensive second model that grades it News
The standard way to polish a model with rewards quietly runs a second 'critic' model alongside it. A new method derives the critic's judgment from the model itself, dropping the extra cost.
Scaling laws — does bigger always mean better? Lesson
For years, AI progress ran on a simple recipe: make the model bigger, feed it more data, get a better model. That pattern is real and predictable — but it has limits and surprises. Here's what scaling laws actually say.
Robots may not need to picture the future as video to act on it News
Generating a full imagined video of what comes next is expensive. A new method skips it — pulling a robot's next move straight from the inner workings of an image-editing model.
A world model that thinks in loops instead of stacking layers News
Instead of building an ever-deeper neural network to simulate the future, a new design re-runs one small block over and over — doing comparable work with a fraction of the size.
Faster AI training by quietly cloning the model News
Teaching a model with rewards is slow because it has to write out endless practice answers. A new trick: make a cheap, shrunk-down copy of the model to crank those out faster.
Do robots even need to imagine the movie? News
The common belief is that a robot needs to imagine a video of what happens next to plan. A new method says no — imagine a single still frame, and don't even fully draw it.
A tiny image-fixer keeps up with a model fifty times its size News
Filling in the missing parts of an image usually takes a huge model. This one is a small fraction of the size and far faster, yet matches a system far bigger than it.
JetSpec Tool
Parallel tree-drafting speculative decoding aiming for large, lossless inference speedups; project page and writeup with code, reporting up to several-times faster generation depending on the model and workload.
DeepSeek DSpark Tool
Open-source speculative-decoding implementation using parallel tree drafting to speed up text generation with no change to the model's output - the project that topped Hacker News this week. Drop-in inference speedups for self-hosted models.