News · 2026-06-19

Your AI judge might be reliable — and still be wrong

Over the past two years, one of the main tools for measuring AI quality has been a "language model judge": another AI that evaluates the first AI's outputs and decides which is better. These judges power everything from the training technique that makes models helpful (called RLHF) to research leaderboards to automated test suites. If the judges are unreliable or biased, everything built on top of them is built on a shaky foundation.

A new paper (arXiv:2606.19544) is the largest systematic audit of language model judges to date: twenty-one judges from nine providers, three popular judge benchmarks, and more than half a million individual grading decisions — including the most capable AI systems available as of spring 2026. The core thesis is stated directly in the title: judges have been found to be reliable (they give consistent answers) without being valid (correct). These are different things, and the field has been systematically conflating them.

The most consequential finding involves a basic statistical correction that is almost never applied. When you measure whether a judge agrees with human labels, you get a number that looks impressive — say, agreement on eighty or eighty-five percent of comparisons. But this doesn't account for how often the judge would agree by chance, even if it were guessing randomly. On a benchmark with three roughly equal categories, you'd expect random guessing to agree with human labels a third of the time just by chance. There's a standard correction called Cohen's kappa that removes this "chance floor." When applied to the most widely used judge benchmark, it deflates the apparent reliability of judges by an average of about thirty-eight percentage points — not a rounding error, but a reversal of the conclusion. Judges that looked "excellent" by raw agreement turn out to be merely "moderate" once chance is accounted for.

The second finding is rank instability. Depending on which benchmark you use to measure judges, the ranking of which judge is "best" changes substantially. More than half the judges in the study shifted by four or more rank positions when the benchmark changed. The worst case in the study was a single model that fell from fifth place to twentieth — a fifteen-position swing from just switching the evaluation task. This isn't because the judges got worse; it's because different benchmarks use different mixes of tasks, and small differences in performance get amplified or compressed differently on each.

The third finding is the most conceptually important: high consistency and severe bias can live in the same judge simultaneously. The researchers found judges that gave the same answer every time they were asked (high reliability) while systematically preferring whichever answer appeared first in the comparison (high position bias). In the extreme case, a judge that always picks "Answer A" regardless of quality would score perfect test-retest reliability and maximum position bias simultaneously. Reliability measures whether the output is stable. It says nothing about whether the output is correct.

One piece of genuinely good news: the old complaint that AI judges prefer longer answers has largely faded. All twenty-one judges in the study showed verbosity bias so small as to be practically negligible — an order of magnitude smaller than it was a few years ago. Length-normalizing your judge prompts is probably no longer necessary on modern frontier models.

The paper proposes a five-item checklist for validating judges before trusting them: chance-correct the agreement metric, test whether swapping the order of answers changes the result, replicate the grading at least three times to catch instability, validate across at least two different benchmarks, and specifically check that judges with very high consistency are not also showing position bias. None of these steps is expensive or technically demanding. Most current published work does zero of them.

For anyone building reward models, running automated evaluations, or relying on judge-based quality scores to guide training, the practical upshot is direct: your existing judge validation is probably overclaiming by a meaningful amount, and a positionally-biased judge that just picks "A" would pass your current test suite. The stakes are high — if the reward signal that shapes a model's behavior is calibrated against a broken judge, the broken-ness gets baked into every model trained that way.

Primary source, verified: read the paper → (arXiv 2606.19544)