News · 2026-07-04

Why Asking an AI the Same Question 10,000 Times Barely Helps

Sampling the same question from an AI model thousands of times and picking the most common answer runs into two hard ceilings, and past a certain point it can make the model's answers worse, not better, according to a new analysis. Researchers Yong Yi Bay and Kathleen A. Yearick, the same University of Illinois Urbana-Champaign team behind a companion paper on reinforcement learning recipes, show that ten thousand samples on a real task can carry only as much information as roughly two independent ones.

Key facts

Majority-vote accuracy from repeated sampling plateaus, and can decline, once sample counts get large, because the samples are correlated rather than independent.
Measured on public evaluation logs, ten thousand samples can be worth only about two truly independent samples for estimating a score.
"Coverage" - whether at least one sample among many is correct - keeps improving with more samples, even as majority-vote accuracy stalls.
By Yong Yi Bay and Kathleen A. Yearick, University of Illinois Urbana-Champaign, posted 27 June 2026 (arXiv 2606.28661); code and data are public on GitHub.

A common trick for squeezing more accuracy out of an AI model without retraining it is called "best-of-N" or majority voting: instead of asking the model once, you ask it to attempt the same problem many times, then either take the most common answer as the final one, or use some outside checker to pick the best attempt from the pile. The intuition is straightforward - more tries should mean a better shot at the right answer, the same way flipping a coin more times gives you a more reliable read on whether it's fair.

The paper's central finding is that this intuition breaks down because the many attempts an AI model makes at one problem are not independent draws the way separate coin flips are. They are all shaped by the same underlying model, working on the same problem, so they tend to share the same blind spots and the same mistakes. The researchers compare it to polling ten people from the same household instead of ten strangers off the street: you can ask all ten, but you are not really getting ten independent opinions, because household members tend to already agree with each other. Sampling from a language model works the same way - additional samples from the same model on the same prompt tend to repeat the model's existing biases rather than exploring genuinely new territory.

Measured against real evaluation logs, that correlation turns out to be strong. Ten thousand samples on a task can be worth, in terms of genuinely new information, only about two independent samples. Practically, that means for majority voting, accuracy stops improving well before you'd expect from the raw sample count, and it plateaus. Worse, when the single most common answer among the samples happens to be wrong, throwing more samples at the problem does not fix it - it just makes the model more confident in that same wrong answer, since more attempts pile onto the same mistaken pattern. The paper's title captures the finding directly: "When More Sampling Hurts."

There is, however, a genuinely useful distinction buried in the same data. "Coverage" - the question of whether at least one of the many sampled answers is correct, even if it's not the most common one - keeps climbing as you add more samples, well past the point where majority voting has stalled out. That matters because it means the bottleneck isn't generating a correct answer; a correct answer is very often already sitting somewhere in the pile of attempts. The bottleneck is recognizing which one of the samples is the correct one. If you have a reliable outside way to check an answer - a verifier, such as running code to see if it actually works, or checking a math answer against a known solution - then generating more samples still keeps paying off, because you're using the extra samples to raise the odds that a correct answer appears somewhere among them, and letting the verifier do the identifying instead of a popularity contest. The authors suggest researchers should stop reporting raw sample counts as if they represented that much independent evidence, and instead report an "effective number of samples" that accounts for this correlation.

The honest limit on all of this is that it applies specifically to the estimate-and-vote setting, where the AI's own aggregate output is the final answer. It does not apply in the same way when a trustworthy verifier is available to sort the good answers from the bad ones. For background on how a model decides what to say next in the first place, see /learn/how-ai-picks-its-next-word.html, and for the companion finding on how training recipes handle a related kind of statistical noise, see our story on /news/three-rl-recipes-are-really-one-number.html.

Primary source, verified: read the paper → (arXiv 2606.28661)

Key questions

Does sampling more answers from an AI always make it more accurate?

No - for picking an answer by majority vote, accuracy plateaus and can even get worse with more samples, because the samples are correlated rather than independent.

Why doesn't more sampling keep helping?

Because repeated samples on the same problem are like polling several people from the same household rather than strangers, so extra samples add much less new information than raw counting suggests.

Is there any way more sampling still helps?

Yes - if there is a reliable way to check which sampled answer is actually correct, a verifier, then generating more samples keeps paying off, because more tries raise the chance at least one is right.

Cite this

APA

Ground Truth. (2026, July 4). Why Asking an AI the Same Question 10,000 Times Barely Helps. Ground Truth. https://groundtruth.day/news/why-more-samples-stop-helping-your-ai.html

BibTeX

@misc{groundtruth:why-more-samples-stop-helping-your-ai,
  title  = {Why Asking an AI the Same Question 10,000 Times Barely Helps},
  author = {{Ground Truth}},
  year   = {2026},
  month  = {jul},
  url    = {https://groundtruth.day/news/why-more-samples-stop-helping-your-ai.html}
}

Topics: reasoning · test-time compute · sampling · evaluation · theory

Comments are replies to this story on Bluesky — reply with any Bluesky account to join in.