Learn · Intermediate

Fine-tuning and LoRA: teaching an old model a new job without retraining it

Training a large AI model from nothing is one of the most expensive things humans do with computers - millions of dollars, months of time, oceans of text. Almost nobody does it. Instead, the field runs on a much cheaper idea: take a model that has already been trained on the whole internet and learned the general shape of language, then give it a small, focused nudge toward the specific job you care about. That nudge is called fine-tuning, and it is how a general-purpose model becomes a medical-notes summarizer, a customer-service bot that speaks in your brand's voice, or a code assistant tuned to your company's style.

To see why this works, it helps to remember the two-stage life of a model, which we cover in training vs inference. The first stage, pretraining, is the massive, expensive one: the model reads a huge chunk of the internet and learns grammar, facts, reasoning patterns, the works. What comes out is a model with broad competence but no particular focus - a brilliant generalist. Fine-tuning is a second, far smaller training stage layered on top. You show the model a modest set of examples of the exact behavior you want - a few hundred or a few thousand, not billions - and let it adjust so that behavior becomes its default. It is the difference between a medical-school graduate and a trained cardiologist: same foundation, a focused specialization added on top. Crucially, the model keeps everything it learned in pretraining; you are steering it, not rebuilding it.

But classic fine-tuning has a brutal cost problem. A large model's knowledge lives in billions of internal numbers called weights, and traditional "full" fine-tuning means adjusting all of them, then saving a complete new copy of the multi-hundred-gigabyte model for every task. Fine-tune it for legal work and again for marketing and you now store two giant models. That is expensive to compute, expensive to store, and out of reach for anyone without a data center. This is the wall that a technique called LoRA - short for low-rank adaptation - tore down, and it is why fine-tuning went from a big-lab luxury to something a hobbyist can do on a single graphics card.

The LoRA insight, from the 2021 paper that introduced it, is beautifully lazy. Instead of editing the model's billions of weights, you freeze the entire original model - touch nothing - and bolt on a tiny set of new numbers alongside it. During fine-tuning, only those small add-on numbers learn; the giant frozen model just provides its existing knowledge underneath. The add-on is small because of a mathematical shortcut: the change you need to make to a giant grid of weights can be closely approximated by two much skinnier grids multiplied together, so you train those two skinny grids instead of the enormous one. The result is an adapter that is often thousands of times smaller than the full model - small enough to email. Think of the base model as an expensive published textbook you are not allowed to write in, and LoRA as a set of margin sticky notes: the book stays pristine, the notes carry your customization, and you can peel off one set of notes and slap on another to switch the model between tasks instantly. That last part is a real practical win - you keep one copy of the big model and swap tiny adapters for legal, marketing, or support.

A follow-up called QLoRA pushed this further by combining LoRA with quantization - compressing the frozen base model to use less memory - so you can fine-tune genuinely huge models on a single consumer graphics card. Between them, LoRA and QLoRA are a big reason the open-model community, which we cover in open-weight models, can produce endless specialized variants of a shared base.

One last distinction trips people up constantly, so let's nail it. Fine-tuning is not the only way to make a model do what you want, and often it is the wrong tool. If you just need the model to know some facts - your company's current pricing, a document, today's data - you usually don't fine-tune at all. You hand that information to the model at question time, either by pasting it into the prompt or through retrieval-augmented generation, which looks things up and feeds them in. The rule of thumb: fine-tuning teaches a skill or a style; retrieval supplies knowledge. Want the model to always respond in legal-brief format, or reliably follow a tricky output structure, or adopt a consistent voice? That is a behavior - fine-tune it. Want it to answer questions about a document that changes every week? That is knowledge - retrieve it, because fine-tuning bakes information in permanently and re-baking every week is absurd. Reaching for fine-tuning when you needed retrieval (or vice versa) is one of the most common and costly mistakes in applied AI. Get that distinction right and you have most of what you need to decide, for any real task, whether to teach the model something new or simply to tell it something.

Key papers
LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)

Key questions

What is fine-tuning in AI?

Fine-tuning is taking a model that has already learned general skills from huge amounts of data and giving it a smaller, targeted round of training to specialize it for a particular task, tone, or domain. It adapts an existing model instead of building a new one from scratch.

What is LoRA and why is it popular?

LoRA, or low-rank adaptation, is a way to fine-tune a model by training a small set of add-on numbers while leaving the original model frozen, cutting the memory and cost of customization dramatically. It is popular because it lets people fine-tune huge models on a single consumer graphics card.

Is fine-tuning the same as retrieval or a bigger prompt?

No - fine-tuning changes the model's behavior by adjusting its internal weights, while retrieval and long prompts leave the model unchanged and just feed it extra information at the moment you ask. Fine-tuning teaches a skill; retrieval hands over reference material.

Topics: fine-tuning · LoRA · training · adaptation · efficiency