News · 2026-06-24

Can an AI Agent Reproduce Real Science? A New Test Says: Rarely

There is a recurring claim in AI right now that the best models are on the verge of doing real science -- not just summarizing papers, but generating genuine discoveries. A new benchmark called NatureBench puts that claim to a hard, concrete test, and the result is a useful splash of cold water.

Here is the setup, because the cleverness is in the design. The researchers took ninety computational tasks drawn from peer-reviewed papers in the Nature family of journals -- some of the most prestigious, heavily scrutinized science published anywhere. Each task captures a real result from a real paper: given this data and this scientific question, reproduce the finding the human researchers actually reached and got past expert reviewers. Then they turned loose today's strongest AI coding agents -- the kind that can write and run their own programs -- and asked a simple question: can you match what the published science achieved? To make the test fair and repeatable, they also built an automated system that wraps each task in a standardized environment, so every agent is graded the same way. This matters because sloppy benchmarks are a real problem, something we explored in the story about how the leaderboard can be lying, and it connects to the broader question of how AI gets benchmarked at all.

The result, conveyed in plain terms: the best models matched or beat the published state of the art on fewer than one in five of the tasks. On the large majority, they fell short. And the way they succeeded when they did succeed is the most revealing part. The authors found that agents tend to win not by inventing new science, but by quietly translating a scientific problem into a familiar shape they already know how to handle -- turning a novel question into a standard prediction exercise they have seen a thousand times. When real scientific invention was required, they mostly failed, and the common failure modes were mundane: picking the wrong method for the problem, or simply not having enough computing power to finish the job properly.

The analogy is the difference between a brilliant student and a working scientist. A strong student can take any problem that resembles their homework and crush it, because they recognize the template and apply it flawlessly. A scientist's actual job begins where the templates run out -- when the problem does not look like anything in the textbook and you have to invent the approach. NatureBench suggests today's agents are superb students and not yet scientists. They are excellent at converting the unfamiliar into the familiar, and stuck when the unfamiliar refuses to be converted.

Why this matters: there is enormous hype, and serious money, riding on the idea that AI is about to accelerate scientific discovery. This benchmark does not say that is impossible, but it draws a sharp, honest line around where the technology actually is. Reproducing published results is, in an important sense, the easy version of the dream -- the answer already exists and is known to be correct. If agents can only match top-tier published work on a small fraction of cases, the harder dream of generating genuinely new, correct discoveries is further off than the most excited headlines imply. It is a healthy corrective to a field that loves to extrapolate, and it complements other recent work pushing agents toward real lab science, like the systems that run their own experiments.

The caveat cuts both ways, as the fairest ones do. On the skeptical side, a benchmark is a snapshot, and these agents are improving quickly -- a score that looks modest today can climb fast, and 'fewer than one in five' a year from now could read very differently. On the other side, even this number deserves scrutiny: matching a published computational result is not the same as independently validating that the result is true, and an agent that hits the target by translating problems into familiar templates may be gaming the format rather than doing science. The real value here is not the score but the diagnosis -- a clear, reproducible account of how these agents win and how they fail, which is worth more than any single percentage. It gives the field a concrete place to push next, instead of another round of vague claims about machines on the cusp of discovery.

Primary source, verified: read the paper → (arXiv 2606.24530)