rl-post-training
Chain-of-thought: why making an AI think out loud makes it smarter Lesson
Asking a model to work through a problem step by step, instead of blurting an answer, dramatically improves it on hard tasks. Here is why that simple trick works, what it really buys the model, and where it backfires.
Teaching AI with rewards — minus the expensive second model that grades it News
The standard way to polish a model with rewards quietly runs a second 'critic' model alongside it. A new method derives the critic's judgment from the model itself, dropping the extra cost.
Polishing AI by looking inside its 'mind' instead of just thumbs-up, thumbs-down News
Reward training usually treats the model as a black box — thumbs up, thumbs down, hope for the best. A new method peers inside to see why an answer was preferred, and shapes the lesson on purpose.
The little words that keep AI from getting boring News
Rewarding a reasoning model too hard makes it repetitive — and the casualties are tiny words like "but" and "instead" that let it branch to a better thought. A near-free fix protects them.
Reward-based fine-tuning (RLHF and RLVR) Lesson
After a model is first trained, it gets "polished" by rewarding good answers. Here's what that phase is, why it works, and the failure mode where models get repetitive and dull.
Faster AI training by quietly cloning the model News
Teaching a model with rewards is slow because it has to write out endless practice answers. A new trick: make a cheap, shrunk-down copy of the model to crank those out faster.
Crediting an AI for the right steps — without a second model to judge them News
When you reward an AI for a good final answer, it's hard to know which of its steps earned the credit. The usual fix is training a second 'judge' model. This skips that.