News · 2026-06-21

A 61-author paper argues AI leaderboards quietly mislead everyone

Every week, someone announces that a new AI is now 'number one' on some leaderboard. We've all learned to read those rankings as a scoreboard: higher is better, top of the list wins. A sprawling new position paper — sixty-one authors, led from IBM — argues that this instinct is quietly, systematically wrong, and that the way the field ranks AI agents is closer to grading students on a practice test and then being shocked when they flunk the real exam. You can read it on arXiv.

First, the background a newcomer needs. An 'AI agent' is a model that doesn't just chat — it takes actions: browses files, calls tools, runs code, works through a multi-step job on its own. To compare agents, researchers build benchmarks: standardized batteries of tasks, scored, averaged into a single number, sorted into a leaderboard. That single number is what gets quoted in announcements and what buyers use to decide which system to trust with real work.

The paper's core finding is about what that number leaves out. The authors point out that no single benchmark captures more than a handful of the things that actually matter once an agent is deployed — how it handles different kinds of data, how it's wired together with other tools, how it retrieves information, how it reasons, how it copes when the infrastructure around it changes. To probe this, they ran an unusually large coordinated effort: fourteen parallel deep-dive studies of one industrial agent benchmark, then combined those with seven earlier benchmarks. Their conclusion is blunt: rankings built from average scores do not transfer to new, out-of-distribution situations. An agent that tops the chart on the public test can tumble when the test is swapped for one it hasn't effectively memorized — and the paper cites real 'public test versus hidden test' competition results showing exactly that kind of rank scrambling.

Here's the idea with an analogy. Imagine ranking restaurants purely by how they perform on one fixed tasting menu, announced in advance. Chefs would, naturally, perfect that exact menu. The leaderboard would then tell you who cooks that one meal best — and almost nothing about who'll cook you a great dinner from ingredients they didn't know were coming. A high score can mean genuine skill, or it can mean the test leaked into the training and the model is essentially reciting answers. From the outside, those two look identical. (This is the same trap behind a recent finding that models acing Python coding tests stumble in other languages — see AI coding skill in Python doesn't carry over — and it rhymes with why AI judges can be confident and wrong.)

What the authors actually propose is a different way to rank. Instead of sorting systems by their average score on the test in front of you, sort them by predictive validity — how well a ranking measured on one set of tasks predicts the ranking on a different, unseen set. In plain terms: don't reward the system that scores highest today; reward the system whose 'good today' reliably means 'good tomorrow.' They lay out a twelve-layer measurement scheme and, refreshingly, three specific, falsifiable tests their own claim must pass, plus a pre-registered pilot to run them.

Why it matters: leaderboards aren't just bragging rights. Companies make purchasing decisions, and researchers steer entire labs, based on these numbers. If the numbers reward memorizing the test rather than general competence, the whole field is being pulled, gently and constantly, toward looking good on benchmarks instead of being good at work. Naming that dynamic — and proposing a concrete metric that resists it — is the kind of plumbing that doesn't trend but quietly improves everything downstream. (For the bigger picture on how this all works, see our new explainer, how AI gets benchmarked.)

The honest caveat is one the authors volunteer themselves: they write that the existing evidence 'partly supports' their position but is 'too thin to confirm' it. This is a manifesto with a research plan attached, not a closed case. The skeptical reflex it's trying to instill is healthy; the specific cure — measuring predictive validity at scale — still has to prove it works better than the disease. But as a statement of the problem, it lands, and it arrives at a moment when 'we topped the leaderboard' has never been a louder marketing line.

Primary source, verified: read the paper → (arXiv 2606.19704)