News · 2026-06-28

An open model from China beat Claude on a security test -- at a sixth of the cost

The security company Semgrep published a blog post with a deliberately cheeky title -- We Have Mythos At Home -- and the joke lands because the result underneath it is real. On one specific security task, a free model anyone can download beat Anthropic's Claude, and did it for roughly a sixth of the price.

Here is the background a non-expert needs. A huge share of real-world web bugs come from one boring mistake: a site checks that you are logged in, but forgets to check that the thing you are asking for actually belongs to you. Change the order number in the address bar from 1001 to 1002 and you are suddenly looking at someone else's invoice. Security people call this a broken-access-control or IDOR bug. It is everywhere, it is costly, and it is exactly the kind of needle-in-a-haystack reading job people now hand to AI: point a model at a codebase and ask, where can a user reach data that isn't theirs?

Semgrep built a fair test around that question and ran several models through it. The standout was GLM 5.2, an open-weight model from the Chinese lab Zhipu AI. On the narrow task of catching these access-control bugs, GLM 5.2 scored ahead of Claude Code -- and because GLM is free to download and cheap to run, the cost per bug it found was about a sixth of Claude's. For a security team scanning millions of lines, that gap is the difference between scanning everything and scanning a sample.

How does a model do this at all? GLM 5.2 is a mixture-of-experts design: it is enormous on paper -- hundreds of billions of parameters -- but for any given chunk of text it only switches on a small slice of itself, which keeps it fast and affordable. It also reads up to about a million tokens at once, enough to hold a fair-sized codebase in working memory while it reasons about who can reach what. And critically, it ships under a permissive MIT license, so a company can run it on its own machines and never send a line of proprietary code to anyone else.

Now the honest caveat, which Semgrep itself is careful to make, and which matters more than the headline. This is one narrow win, not a coronation. On harder, longer programming tasks -- the kind that involve juggling a whole project over many steps -- GLM 5.2 still trails the top closed models by a wide margin. And the sharpest point in the writeup is that the model alone was not even the best result on Semgrep's own board: their full scanning pipeline, the model wrapped in custom tooling and checks, beat every bare model by a healthy margin. The lesson is that how you wire a model into a system matters at least as much as which model you pick. A bare benchmark score is the start of the story, not the end of it -- which is also why you should treat any single benchmark result with care.

Still, the direction is what has people talking. A year ago the assumption was that frontier capability lived behind a handful of American API keys. The Semgrep result is a clean, reproducible data point that on at least one economically important task, a free model you can run in your own building is now the rational default. Developers on the local-AI forums have been saying the same thing in plainer language: they are quietly moving day-to-day work onto GLM and keeping the expensive models for the genuinely hard problems. Combine that with the fact that the most powerful American models are getting harder to access, and you can see why a cheap, open, capable alternative suddenly feels less like a curiosity and more like infrastructure.

Primary source, verified: read the paper →