evaluation

Everything on Ground Truth tagged “evaluation” — 13 items.

Knowing when to quit is a skill AI agents badly lack News

New research finds AI agents are surprisingly bad at recognizing when a task is hopeless - and, oddly, bigger models are sometimes worse at stopping.

When AI safety training withholds what could help you News

A pre-registered study finds heavily safety-trained models give doctors medical information they refuse to give ordinary people, with identical facts.

What does your AI actually remember about you? News

Two new studies stop trusting that agent 'memory' works and start measuring it directly, with results that carry a privacy sting.

Can an AI agent match real published science? A new test says: rarely News

NatureBench pits coding agents against the published state-of-the-art from Nature-family papers. Even the best agents beat the bar on a small minority of tasks -- mostly by reframing, not inventing.

How AI Gets Benchmarked — and Why the Leaderboard Can Lie Lesson

Every 'this AI is now #1' headline rests on a benchmark. Here's how those tests actually work, why a top score doesn't always mean what you think, and how to read a leaderboard like a skeptic.

A 61-author paper argues AI leaderboards quietly mislead everyone News

A large industry-led study makes a blunt case: the rankings everyone cites to pick the 'best' AI agent don't survive contact with the real world.

Why does AI make things up? Lesson

Language models sometimes state false things with total confidence — a behavior called hallucination. It isn't a bug they'll simply patch out; it falls out of how they're built. Here's why it happens and how people fight it.

What does it mean for AI to grade AI? Lesson

We increasingly use one AI model to evaluate another's answers — because human grading doesn't scale. Here's how 'AI as a judge' works, why it's everywhere, and the traps that make it unreliable.

Independent testers probed the labs' secret models — and graded the danger News

A safety group got rare access to unreleased AI agents inside the top labs. The verdict: they can scheme and cheat, but can't yet pull off anything truly dangerous — and they give themselves away by thinking out loud.

AI coding skill in Python doesn't carry over to other languages News

A widely-trusted coding benchmark was Python-only. Expanding it to a dozen languages revealed that models acing Python often stumble badly elsewhere — Python skill isn't general coding skill.

Your AI judge might be reliable — and still be wrong News

The largest audit of AI language model judges to date — 21 judges, over half a million grading decisions — finds that standard reliability metrics are inflated by roughly a third, that the same judge can score differently on different benchmarks, and that high consistency and severe bias can coexist in the same system.

Reliable, and still wrong News

Using one AI to grade another is now common — but the biggest audit yet shows these graders are consistent without being correct. A judge that always picks "answer A" scores perfectly on consistency.

Doubleword (async + batch inference) Tool

Run the same models you already use, but on async and batch tiers that trade latency for a large cost cut on workloads that don't need an instant reply: long-running agents, evaluations, and bulk jobs.