inference

Everything on Ground Truth tagged “inference” — 21 items.

Temperature and top-p: how an AI actually picks its next word Lesson

A language model does not know its next word - it produces a list of odds and then rolls dice. The rules of that dice roll are why the same prompt gives you a boring answer one day and a wild one the next.

Ollama nearly doubles Gemma's speed on Macs by guessing ahead News

A free local-AI tool now runs Google's Gemma model far faster on Apple computers using a trick where a small model drafts words and the big one checks them in bulk.

The KV cache: why AI gets slower and hungrier the longer it talks Lesson

The hidden notebook that lets a model avoid re-reading every previous word - and the single biggest reason long context is expensive.

The trick that makes AI type faster just hit the top of Hacker News News

A small model guesses ahead and a big model checks the work in parallel - and this week two efforts pushing that idea, DeepSeek's DSpark and JetSpec, lit up the front page while the community argued over whether it's truly 'lossless.'

Speculative Decoding: How AI Types Faster Without Changing a Word Lesson

A small, fast model guesses the next few words and a big, slow model checks them all in one pass - producing the exact same output, just quicker. The trick behind a lot of modern AI speedups.

Quantization: Shrinking AI Models to Run on Modest Hardware Lesson

Storing a model's numbers with less precision - 8, 4, or even fewer bits instead of 16 - makes it dramatically smaller and faster, often with almost no loss in quality. It's why big models can run on a laptop or a single GPU.

Frontier AI is getting more expensive while open models keep getting cheaper News

Closed frontier models are raising prices and tightening access just as Chinese open-weight models slash theirs, a structural reversal with big consequences for who builds with AI.

Chain-of-thought: why making an AI think out loud makes it smarter Lesson

Asking a model to work through a problem step by step, instead of blurting an answer, dramatically improves it on hard tasks. Here is why that simple trick works, what it really buys the model, and where it backfires.

Training vs inference: the two very different jobs inside every AI Lesson

Why building an AI model and using it are separate worlds with separate costs, and why that split explains custom chips, model prices, and where the real money in AI actually goes.

OpenAI designs its own chip to run its models News

With Broadcom, OpenAI unveiled a custom chip built for one job: serving its AI models cheaply.

Two labs race to make AI write whole paragraphs at once instead of word by word News

Diffusion text models generate in parallel blocks rather than left to right; Google's open DiffusionGemma and Inception's Mercury 2 are now in a head-to-head over speed.

Suddenly, downloadable AI models look like an insurance policy News

With a top hosted model pulled overnight, a flood of powerful open models you can run yourself -- and run fast -- is being reframed from hobby to risk management.

vLLM v0.23.0 Tool

The widely-used open engine for serving language models fast and cheaply. The latest release adds smarter memory handling for long conversations and faster GPU execution.

llama.cpp Tool

The lean, fast engine that makes big models run on ordinary laptops; powers much of the local-AI ecosystem.

SGLang v0.5.13 Tool

A high-performance open serving engine for language models. The new version turns on faster 'guess-ahead' decoding by default and trims scheduling overhead for quicker responses.

Ollama 0.31 Tool

Run open models on your own computer; the new version nearly doubles Gemma's speed on Apple Silicon using multi-token prediction, on by default.

Modular MAX + Mojo Tool

A programming language (Mojo) and compiler/runtime (MAX) for running AI models efficiently across different hardware instead of being locked to one chip vendor; now being acquired by Qualcomm but still openly available to developers.

JetSpec Tool

Parallel tree-drafting speculative decoding aiming for large, lossless inference speedups; project page and writeup with code, reporting up to several-times faster generation depending on the model and workload.

GLM-5.2 on Baseten Tool

The top trending open-weight model served as a fast hosted endpoint, reported at 280+ tokens/sec on Blackwell-class hardware -- an open model you can call like a closed one.

Doubleword (async + batch inference) Tool

Run the same models you already use, but on async and batch tiers that trade latency for a large cost cut on workloads that don't need an instant reply: long-running agents, evaluations, and bulk jobs.

DeepSeek DSpark Tool

Open-source speculative-decoding implementation using parallel tree drafting to speed up text generation with no change to the model's output - the project that topped Hacker News this week. Drop-in inference speedups for self-hosted models.