Learn · Intermediate
Fine-tuning and LoRA: teaching an old model a new job without retraining it
Training a large AI model from nothing is one of the most expensive things humans do with computers - millions of dollars, months of time, oceans of text. Almost nobody does it. Instead, the field runs on a much cheaper idea: take a model that has already been trained on the whole internet and learned the general shape of language, then give it a small, focused nudge toward the specific job you care about. That nudge is called fine-tuning, and it is how a general-purpose model becomes a medical-notes summarizer, a customer-service bot that speaks in your brand's voice, or a code assistant tuned to your company's style.
To see why this works, it helps to remember the two-stage life of a model, which we cover in training vs inference. The first stage, pretraining, is the massive, expensive one: the model reads a huge chunk of the internet and learns grammar, facts, reasoning patterns, the works. What comes out is a model with broad competence but no particular focus - a brilliant generalist. Fine-tuning is a second, far smaller training stage layered on top. You show the model a modest set of examples of the exact behavior you want - a few hundred or a few thousand, not billions - and let it adjust so that behavior becomes its default. It is the difference between a medical-school graduate and a trained cardiologist: same foundation, a focused specialization added on top. Crucially, the model keeps everything it learned in pretraining; you are steering it, not rebuilding it.
But classic fine-tuning has a brutal cost problem. A large model's knowledge lives in billions of internal numbers called weights, and traditional "full" fine-tuning means adjusting all of them, then saving a complete new copy of the multi-hundred-gigabyte model for every task. Fine-tune it for legal work and again for marketing and you now store two giant models. That is expensive to compute, expensive to store, and out of reach for anyone without a data center. This is the wall that a technique called LoRA - short for low-rank adaptation - tore down, and it is why fine-tuning went from a big-lab luxury to something a hobbyist can do on a single graphics card.
The LoRA insight, from the 2021 paper that introduced it, is beautifully lazy. Instead of editing the model's billions of weights, you freeze the entire original model - touch nothing - and bolt on a tiny set of new numbers alongside it. During fine-tuning, only those small add-on numbers learn; the giant frozen model just provides its existing knowledge underneath. The add-on is small because of a mathematical shortcut: the change you need to make to a giant grid of weights can be closely approximated by two much skinnier grids multiplied together, so you train those two skinny grids instead of the enormous one. The result is an adapter that is often thousands of times smaller than the full model - small enough to email. Think of the base model as an expensive published textbook you are not allowed to write in, and LoRA as a set of margin sticky notes: the book stays pristine, the notes carry your customization, and you can peel off one set of notes and slap on another to switch the model between tasks instantly. That last part is a real practical win - you keep one copy of the big model and swap tiny adapters for legal, marketing, or support.
A follow-up called QLoRA pushed this further by combining LoRA with quantization - compressing the frozen base model to use less memory - so you can fine-tune genuinely huge models on a single consumer graphics card. Between them, LoRA and QLoRA are a big reason the open-model community, which we cover in open-weight models, can produce endless specialized variants of a shared base.
One last distinction trips people up constantly, so let's nail it. Fine-tuning is not the only way to make a model do what you want, and often it is the wrong tool. If you just need the model to know some facts - your company's current pricing, a document, today's data - you usually don't fine-tune at all. You hand that information to the model at question time, either by pasting it into the prompt or through retrieval-augmented generation, which looks things up and feeds them in. The rule of thumb: fine-tuning teaches a skill or a style; retrieval supplies knowledge. Want the model to always respond in legal-brief format, or reliably follow a tricky output structure, or adopt a consistent voice? That is a behavior - fine-tune it. Want it to answer questions about a document that changes every week? That is knowledge - retrieve it, because fine-tuning bakes information in permanently and re-baking every week is absurd. Reaching for fine-tuning when you needed retrieval (or vice versa) is one of the most common and costly mistakes in applied AI. Get that distinction right and you have most of what you need to decide, for any real task, whether to teach the model something new or simply to tell it something.
LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
Key questions