Learn · Beginner

Retrieval-Augmented Generation: giving a model an open book

A plain language model is a closed-book exam taker. It answers from memory -- everything it absorbed during training, frozen at some cutoff date -- and it has no way to check a fact, look up your company's internal docs, or know what happened yesterday. Worse, when it does not know, it does not fall silent; it makes something up that sounds right. Retrieval-Augmented Generation, or RAG, is the now-standard fix, and the idea is exactly what it sounds like: turn the closed-book exam into an open-book one. Before the model writes a word, it goes and fetches relevant material, and then it answers from what is in front of it.

Here is the flow, concretely. Suppose you want a model to answer questions about your company's HR policies. First, offline, you take all those policy documents and chop them into bite-sized chunks. You run each chunk through an embedding model, which turns it into a vector -- a point in space where nearby points mean similar meaning -- and you store all those vectors in a database. That is the indexing step, done once. Then, when a user asks how many vacation days do I get after five years?, you embed the question into a vector too, and you search the database for the chunks whose vectors sit closest to it. Those are, by construction, the passages most semantically related to the question -- even if they never use the words vacation days. Finally, you paste those retrieved chunks into the model's prompt along with the question, and instruct it: answer using this material. The model reads the actual policy and writes the answer from it.

The payoff is three-fold and it is why RAG is everywhere. First, freshness: you can update the document store any time without retraining the model, so the system knows about today's information, not just its training cutoff. Second, private knowledge: the model can answer about your internal documents, which were never in its training data and never could be. Third, and most important, grounding and citations: because the answer is built from specific retrieved passages, the system can show its sources, and the model is far less likely to invent facts when the real ones are sitting in its context. RAG does not cure hallucination, but it dramatically reduces it by changing the task from recall-from-memory to read-and-summarize. This is the core argument of the original RAG paper by Lewis and colleagues, which paired a retriever with a generator and trained them to work together; REALM made a similar case for baking retrieval into pre-training itself.

The quiet hero of the whole scheme is the retriever, and getting it right is where RAG lives or dies. Early search matched keywords; the leap that made modern RAG work was dense retrieval -- matching on meaning via embeddings, so a question about time off finds a passage about paid leave even with zero shared words. Dense Passage Retrieval showed this beating classic keyword search on open-domain question answering, and it is the template most systems still follow. The analogy is the difference between a librarian who only finds books with your exact title words and one who understands what you actually mean and walks you to the right shelf.

The honest caveats matter, because RAG is powerful but not magic. It is only as good as what it retrieves: if the right passage is not in your store, or the retriever fetches the wrong chunks, the model answers confidently from irrelevant material -- garbage in, fluent garbage out. Chunking is fiddly -- cut documents too small and you lose context, too big and you bury the relevant sentence in noise. Retrieved text eats into the model's context window, so there is a real limit on how much you can stuff in. And a subtle failure mode: the model can still ignore or contradict the very passages you handed it, which is why good RAG systems check that the answer is actually supported by the sources. None of this dims the core idea, though. RAG is the bridge between a frozen, general-purpose brain and the specific, current, private knowledge a real application needs -- and it is the backbone of nearly every serious document-question, search, and assistant product shipping today.

Key papers
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)
REALM: Retrieval-Augmented Language Model Pre-Training (Guu et al., 2020)
Dense Passage Retrieval for Open-Domain Question Answering (Karpukhin et al., 2020)