Learn · Beginner

Reward-based fine-tuning (RLHF and RLVR)

A large language model starts life doing one narrow thing: predicting the next word over a staggering amount of text. That makes it fluent and knowledgeable, but completely unfocused — it will happily continue a sentence with no sense of whether it's being helpful, honest, or safe. Almost everything that makes a modern model feel like a useful assistant rather than a fancy autocomplete comes from a second phase, where the model is polished by rewarding good behavior — the same basic idea as training a dog with treats.

RLHF: learning from human preferences

The classic recipe is reinforcement learning from human feedback, or RLHF. The seed idea came from Deep Reinforcement Learning from Human Preferences (Christiano and colleagues, 2017): instead of trying to write down a precise reward — hopeless for something as fuzzy as "a good answer" — you show a system two options and simply let a human say which they prefer. Collect enough of those comparisons and you can train a separate "reward model" that scores answers the way people tend to.

InstructGPT, the work directly behind ChatGPT, put this together at scale: take a raw text-predictor, have people rank its outputs from best to worst, and nudge the model toward the higher-ranked ones. That single phase is most of what turned an aimless autocomplete into something that follows instructions and feels genuinely helpful. The underlying model barely got "smarter" in raw knowledge — it got aimed.

RLVR: rewarding verifiable correctness

For tasks with a checkable right answer — math, code, logic puzzles — you don't even need humans in the loop. You can reward the model whenever its answer passes a test: the equation balances, the code runs, the proof checks out. This is reinforcement learning with verifiable rewards, or RLVR, and it's the engine behind the recent wave of strong "reasoning" models. DeepSeek-R1 was a landmark, showing that letting a model practice against automatically-verified rewards could teach it to reason — to work step by step, backtrack, and check itself — largely on its own, with far less hand-holding than people expected.

A concrete picture

Imagine teaching a student. RLHF is like having an experienced tutor read their essays and say "this one's better than that one," again and again, until the student internalizes good taste. RLVR is like handing them a math workbook with an answer key: they try, check against the key, and adjust — no tutor required, as long as the answers are checkable. Modern models get both kinds of polish, applied to different skills.

The failure mode: it gets boring

Push the reward too hard and something breaks: the model stops exploring and collapses onto one rigid, overconfident style. Researchers call the technical version entropy collapse. We covered a sharp recent example: aggressive reward training quietly starves out the rare pivot words — "but," "wait," "instead" — that let a model second-guess itself, and gently protecting those words keeps it improving far longer.

It's a reminder that this phase is powerful but delicate: reward shapes behavior strongly, and over-shaping it can train away the very hesitation that made the model good at thinking in the first place. A whole strand of current research is about running this phase more carefully and cheaply — for instance, handing out credit for the right steps without training a second model to judge them.

The takeaway

Most of what makes a model feel smart, helpful, and well-behaved happens here, in the reward phase — not in the original next-word training. It's where a text-predictor becomes an assistant, and where a model learns to reason. But it's a balancing act: reward is a blunt, powerful tool, and pushing it too hard trades away diversity and self-doubt for brittle, confident wrongness.

Key papers
Deep RL from Human Preferences (Christiano et al., 2017)
Training Language Models to Follow Instructions / InstructGPT (Ouyang et al., 2022)
DeepSeek-R1 (2025)