efficiency

Everything on Ground Truth tagged “efficiency” — 18 items.

Fine-tuning and LoRA: teaching an old model a new job without retraining it Lesson

You almost never train an AI from scratch. You take one that already knows language and nudge it toward your specific task - and a trick called LoRA lets you do that by adding a tiny sticky note instead of rewriting the whole brain.

The KV cache: why AI gets slower and hungrier the longer it talks Lesson

The hidden notebook that lets a model avoid re-reading every previous word - and the single biggest reason long context is expensive.

The trick that makes AI type faster just hit the top of Hacker News News

A small model guesses ahead and a big model checks the work in parallel - and this week two efforts pushing that idea, DeepSeek's DSpark and JetSpec, lit up the front page while the community argued over whether it's truly 'lossless.'

Speculative Decoding: How AI Types Faster Without Changing a Word Lesson

A small, fast model guesses the next few words and a big, slow model checks them all in one pass - producing the exact same output, just quicker. The trick behind a lot of modern AI speedups.

Quantization: Shrinking AI Models to Run on Modest Hardware Lesson

Storing a model's numbers with less precision - 8, 4, or even fewer bits instead of 16 - makes it dramatically smaller and faster, often with almost no loss in quality. It's why big models can run on a laptop or a single GPU.

Distillation: how a small AI learns from a big one Lesson

Distillation trains a smaller, cheaper model to imitate a larger, smarter one, the idea behind both efficient deployment and the 'copying' accusations now driving AI geopolitics.

Mixture of Experts: The Committee Inside a Giant Model Lesson

Why the biggest AI models are not really one big brain but a large team of specialists, only a few of whom wake up for any given word -- the trick that lets a model be huge and fast at the same time.

A small but elegant idea: putting 'experts' inside the attention layer News

Grouped Query Experts brings the mixture-of-experts trick into attention, activating only half a model's query heads per token while matching the full version -- at least at small scale.

A Classic Efficiency Trick Just Moved Into a New Part of the AI News

For years, the committee-of-specialists design that keeps big models fast lived in one layer of the network. A clean new result shows it works in the attention layer too, halving some of the work for free.

Teaching AI with rewards — minus the expensive second model that grades it News

The standard way to polish a model with rewards quietly runs a second 'critic' model alongside it. A new method derives the critic's judgment from the model itself, dropping the extra cost.

Scaling laws — does bigger always mean better? Lesson

For years, AI progress ran on a simple recipe: make the model bigger, feed it more data, get a better model. That pattern is real and predictable — but it has limits and surprises. Here's what scaling laws actually say.

Robots may not need to picture the future as video to act on it News

Generating a full imagined video of what comes next is expensive. A new method skips it — pulling a robot's next move straight from the inner workings of an image-editing model.

A world model that thinks in loops instead of stacking layers News

Instead of building an ever-deeper neural network to simulate the future, a new design re-runs one small block over and over — doing comparable work with a fraction of the size.

Faster AI training by quietly cloning the model News

Teaching a model with rewards is slow because it has to write out endless practice answers. A new trick: make a cheap, shrunk-down copy of the model to crank those out faster.

Do robots even need to imagine the movie? News

The common belief is that a robot needs to imagine a video of what happens next to plan. A new method says no — imagine a single still frame, and don't even fully draw it.

A tiny image-fixer keeps up with a model fifty times its size News

Filling in the missing parts of an image usually takes a huge model. This one is a small fraction of the size and far faster, yet matches a system far bigger than it.

JetSpec Tool

Parallel tree-drafting speculative decoding aiming for large, lossless inference speedups; project page and writeup with code, reporting up to several-times faster generation depending on the model and workload.

DeepSeek DSpark Tool

Open-source speculative-decoding implementation using parallel tree drafting to speed up text generation with no change to the model's output - the project that topped Hacker News this week. Drop-in inference speedups for self-hosted models.