llm-judges

Everything on Ground Truth tagged “llm-judges” — 1 item.

Your AI judge might be reliable — and still be wrong News

The largest audit of AI language model judges to date — 21 judges, over half a million grading decisions — finds that standard reliability metrics are inflated by roughly a third, that the same judge can score differently on different benchmarks, and that high consistency and severe bias can coexist in the same system.