News · 2026-07-04

Three Popular Ways to Train Reasoning AIs Turn Out to Be One Formula

A new analysis shows that three of the most widely used recipes for training AI reasoning models with reinforcement learning are, underneath their different names, the same idea expressed three ways. GRPO, Dr. GRPO, and DAPO all boil down to different operations performed on one number: the spread, or standard deviation, of the rewards within a group of sampled answers to the same prompt. Researchers Yong Yi Bay and Kathleen A. Yearick at the University of Illinois Urbana-Champaign prove the connection exactly, then show it has a real practical consequence: with a common training setup, nearly half of all prompts teach the model nothing at all.

Key facts

GRPO, Dr. GRPO, and DAPO - three popular recipes for RL training of reasoning models - are shown to be three small operations on one shared quantity: the reward standard deviation within a sampled group of answers.
With a typical group size of 8 sampled answers per prompt, about 44% of prompts on a large math training set are "silent," meaning they produce zero learning signal.
The paper is by Yong Yi Bay and Kathleen A. Yearick, University of Illinois Urbana-Champaign, posted 30 June 2026 (arXiv 2607.00152).
Code and data accompanying the proof are public on GitHub.

Reinforcement learning, in this context, is a training method where a model tries out different answers to a problem, gets graded on whether each one is right or wrong, and is nudged to make its future answers look more like the graded-right ones and less like the graded-wrong ones. To train today's reasoning models - the ones that write out step-by-step solutions to math or coding problems - researchers typically have the model generate a small group of candidate answers per prompt (often 8), grade each one, and use the pattern of right versus wrong within that group to decide how hard to push the model, and in which direction.

The paper's core result is a clean piece of accounting: for any given prompt, the size of the update the model receives is exactly equal to that prompt's reward spread multiplied by the gap between the average score of the right answers and the average score of the wrong ones. Reward spread here just means how much the grades vary within the sampled group - all over the place, or all clustered together. Once you see training this way, the three popular recipes turn out to just be different choices about what to do with that spread. GRPO divides by it, which has the effect of pushing extra-strong updates onto the hardest and the easiest prompts (where a small number of scores creates a small spread). Dr. GRPO leaves the spread out of the calculation entirely. DAPO takes a more surgical approach: it filters out the prompts that are "silent" before they're used at all.

A silent prompt is one where every sampled answer in the group gets the same grade - all right, or all wrong. If the model already always gets a problem right, or always gets it wrong, there's no contrast within the group to learn from, and the reward spread for that prompt is exactly zero. It's a bit like assigning a class of students two versions of a test - one so easy everyone gets 100%, one so hard everyone gets 0% - and expecting to learn who understands the material best from either one. Neither test produces any useful signal, because there's no variation to compare against.

The practical sting is in how common this is. On a large math training set, using the standard group size of 8 samples per prompt, the researchers found that about 44% of prompts are silent - nearly half the training data effectively teaches the model nothing on a given pass, unless the group size is increased or the prompts are chosen more carefully to sit in the range where the model sometimes gets it right and sometimes doesn't.

This is a tidy piece of theory - the authors call it, in their own words, "GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number" - but it comes with a real limit. It's an exact accounting result, verified against one large math training dataset, and it explains precisely what these three recipes are doing mathematically. It doesn't settle which recipe trains the best model in practice, since real training outcomes still hinge on many other choices: how prompts are selected, how large the sampled groups are, and dozens of other hyperparameters the paper doesn't touch.

For background on how this style of training works more broadly, see our explainer on /learn/rl-post-training.html, and for how it differs from methods that reuse older data instead of fresh samples, see /learn/on-policy-vs-off-policy-learning.html.

Primary source, verified: read the paper → (arXiv 2607.00152)

Key questions

What do GRPO, Dr. GRPO, and DAPO actually have in common?

All three turn out to be simple operations on the same underlying number, the standard deviation of rewards within a group of sampled answers to a prompt.

What is a silent prompt?

It is a training prompt where every sampled answer gets the same right-or-wrong grade, so the reward spread is zero and the model learns nothing from it.

Does this mean one of the three recipes is best?

Not exactly - the paper shows what each recipe mathematically does with the reward spread, but which is best in practice still depends on other training choices.

Cite this

APA

Ground Truth. (2026, July 4). Three Popular Ways to Train Reasoning AIs Turn Out to Be One Formula. Ground Truth. https://groundtruth.day/news/three-rl-recipes-are-really-one-number.html

BibTeX

@misc{groundtruth:three-rl-recipes-are-really-one-number,
  title  = {Three Popular Ways to Train Reasoning AIs Turn Out to Be One Formula},
  author = {{Ground Truth}},
  year   = {2026},
  month  = {jul},
  url    = {https://groundtruth.day/news/three-rl-recipes-are-really-one-number.html}
}

Topics: reinforcement learning · GRPO · training · reasoning · theory

Comments are replies to this story on Bluesky — reply with any Bluesky account to join in.