News · 2026-06-24

Can an AI agent match real published science? A new test says: rarely

AI labs love to claim their systems can do science. The claim is usually backed by cherry-picked anecdotes or benchmarks that quietly let the AI look up the answer. A new benchmark called NatureBench (Hugging Face) tries to settle the question more honestly, and its answer is a useful cold shower: today's AI coding agents can apply known scientific techniques fairly well, but they rarely produce anything that beats the real published state-of-the-art -- and almost never by genuine invention.

Here's what the researchers built. They took ninety tasks drawn directly from peer-reviewed papers in the Nature family of journals -- some of the most prestigious science there is -- spanning many disciplines. For each task, the bar to clear is the result the human scientists actually published. Then they handed those tasks to ten of the leading AI agent setups and watched.

Two design choices make this benchmark trustworthy where others aren't. First, they turned off web search. This sounds small but is crucial: if an agent can browse, then "reproduce this published result" becomes "find the paper and copy its answer," which tests memory, not science. By cutting off the lookup, they force the agent to actually do the work. Second, they built a standardized, containerized harness so every task runs in a clean, consistent environment. Past attempts to test agents on research drowned in a swamp the authors call environment fragmentation -- every paper uses different software, different data formats, different setups, so just getting an agent to the starting line was its own ordeal. NatureBench fixes that, which is part of why it's a genuine contribution to how AI is benchmarked.

The results are sobering in a clarifying way. Even the strongest agent configuration managed to beat the published state-of-the-art on only a small minority of the tasks. For the overwhelming majority, the best AI in the world could not match what human scientists had already done. But the most interesting finding is in how the agents succeeded and failed. When they did well, it was through what the authors call methodological translation: taking a hard, unfamiliar scientific problem and reframing it as a familiar, well-understood prediction task the agent already knew how to attack. That's a real and useful skill -- a lot of applied science is recognizing that your weird problem is secretly a standard problem in disguise -- but it is not invention. The agents were good at applying the known, weak at discovering the new.

And when they failed, they mostly failed for mundane reasons: choosing the wrong method for the problem, or simply running out of computing resources, rather than fundamentally misunderstanding the task. That's an important nuance. It means the agents generally grasped what was being asked; they just couldn't figure out the right way to do it or didn't have the horsepower to finish. The wall they hit isn't comprehension -- it's judgment and resourcefulness, the things that separate a competent technician from a creative scientist.

Why it matters: this is a reality check delivered at exactly the right moment, when "AI is doing science" claims are everywhere. It fits a pattern of recent results showing that agents look more capable on flashy benchmarks than they are at the messy real thing -- the same lesson as being good at Python isn't the same as being good at coding and the broader warning that the leaderboard is lying. NatureBench extends that skepticism to the highest-stakes domain: actual published research. For anyone deploying agents to accelerate research, it's a map of where they help today (translating and applying known methods, fast) and where they still don't (genuine scientific creativity).

The honest caveats cut both ways. On one hand, beating Nature-level published results is an extraordinarily high bar -- these are humanity's best efforts in each field, so an agent clearing it even occasionally, with no web access, is arguably impressive rather than disappointing, depending on your priors. On the other, ninety tasks is a snapshot, and benchmarks always risk measuring the tasks that were easy to package rather than the science that matters most. And like every benchmark, it captures this moment; agents are improving quickly, and the share they can match will almost certainly climb. The lasting value of NatureBench may be less the score than the method -- a clean, search-disabled, standardized way to ask the question again every few months and watch the line move.

Primary source, verified: read the paper → (arXiv 2606.24530)