benchmarks

Everything on Ground Truth tagged “benchmarks” — 12 items.

The best AI agents still fail most real, long computer tasks News

A wave of new benchmarks agrees on an uncomfortable result: even top models finish only a small slice of realistic, multi-hour computer and coding jobs.

Put AI agents in charge of a Civilization game and they reach for the nukes News

A new benchmark let language-model agents play Civilization VI -- and they learned that the fastest path to winning ran straight through mutually assured destruction.

An open model from China beat Claude on a security test -- at a sixth of the cost News

Semgrep ran GLM 5.2 against Claude on a narrow vulnerability-finding task and the free, open-weight model came out ahead for far less money.

Can an AI agent match real published science? A new test says: rarely News

NatureBench pits coding agents against the published state-of-the-art from Nature-family papers. Even the best agents beat the bar on a small minority of tasks -- mostly by reframing, not inventing.

Can an AI Agent Reproduce Real Science? A New Test Says: Rarely News

A new benchmark points coding agents at the actual computational results behind ninety papers in top journals. The strongest models matched the published science on fewer than one in five.

How AI Gets Benchmarked — and Why the Leaderboard Can Lie Lesson

Every 'this AI is now #1' headline rests on a benchmark. Here's how those tests actually work, why a top score doesn't always mean what you think, and how to read a leaderboard like a skeptic.

A 61-author paper argues AI leaderboards quietly mislead everyone News

A large industry-led study makes a blunt case: the rankings everyone cites to pick the 'best' AI agent don't survive contact with the real world.

What does it mean for AI to grade AI? Lesson

We increasingly use one AI model to evaluate another's answers — because human grading doesn't scale. Here's how 'AI as a judge' works, why it's everywhere, and the traps that make it unreliable.

AI coding skill in Python doesn't carry over to other languages News

A widely-trusted coding benchmark was Python-only. Expanding it to a dozen languages revealed that models acing Python often stumble badly elsewhere — Python skill isn't general coding skill.

AI 'world models' have short-term memory — they forget what's off-screen News

A sweeping study of dozens of AI video-prediction systems finds they don't truly remember the world; when something leaves the frame, they quietly reinvent it the next time you look.

Turn the camera away, and the AI's world freezes News

A new benchmark tests whether video AI systems can track what happens to parts of a scene the camera isn't currently showing. Across 23 models, the answer is mostly no — and making the models larger made the problem worse, not better.

Reliable, and still wrong News

Using one AI to grade another is now common — but the biggest audit yet shows these graders are consistent without being correct. A judge that always picks "answer A" scores perfectly on consistency.