Ground Truth.
AI, checked against the source.

News · 2026-06-22

An AI wrote a working operating-system kernel from scratch in 38 minutes

If you want to understand why governments suddenly care about how good AI has gotten at code, skip the policy memos and read the minute-by-minute log of a model building an operating-system kernel from nothing. A developer documented exactly that (Tolmo: When the model writes the kernel): starting from a completely empty project folder, one of Anthropic's new models -- working on its own across roughly two hundred back-and-forth turns -- produced a small but genuinely bootable kernel that started up inside an emulator and passed its own built-in tests. The total amount of time the model itself spent thinking and writing was about thirty-eight minutes.

To appreciate how strange that is, you need to know what a kernel is. It's the innermost core of an operating system -- the part that talks directly to the hardware, manages memory, and decides which program runs when. It is famously some of the hardest, most unforgiving code in all of software. A single wrong assumption about how the processor works and nothing boots at all; there's no friendly error message, just a dead screen. Operating-system kernels are normally the domain of small teams of specialists working for months. Watching a model take an empty folder to a booting kernel in well under an hour is a bit like watching someone hand a robot a pile of raw steel and an empty lot and come back to find a small, running engine.

Now the essential caveats, because the headline oversells it. What the model built is a minimal kernel shaped like the core of Windows -- it boots and runs its self-checks, but it is not a full operating system. There's no part where you'd actually log in and run programs; it's the engine block, not the finished car. It runs inside an emulator, a software pretend-computer, rather than on a real laptop. So 'an AI wrote Windows' is wrong. 'An AI wrote, unassisted, the hardest layer of a real operating system, well enough to boot and self-test, in the time it takes to watch a sitcom' is right, and that's startling enough.

There's a small, almost poetic detail buried in the write-up. The project ran longer than the original session, and the later stretch had to switch to a different, older model -- because the model that started the job had been export-suspended partway through, the very shutdown described in this week's bigger story. The kernel demo is, in other words, a live illustration of the exact capability that got the model pulled, interrupted by the pulling.

How does a language model do something like this at all? It's the same underlying machinery behind chatbots -- a system trained to predict the next chunk of text -- but wrapped in a loop that lets it act like a developer: write a file, try to compile it, read the error, fix it, try again, run the tests, repeat. That tight feedback cycle is what separates a model that can describe a kernel from one that can actually produce a working one. Each failed compile is information, and the model keeps folding that information back in until the thing boots. If you want the broader picture of how these self-directed coding systems work, see our explainer on AI agents.

Why it matters is straightforward and double-edged. The same ability that lets a model stand up systems code from scratch is the ability that lets it understand, and potentially exploit, the systems code everyone else relies on. That dual-use quality is precisely what made this capability tier a target for the new oversight rules. It's also why this single anecdote has been passed around so widely: it's concrete in a way that benchmark charts never are. You don't need to trust a score; you can read the log.

The honest caveat: this is one impressive run, documented by one developer, and a curated success story is not the same as reliability. We don't see how many attempts failed, how brittle the result is, or how it would fare on hardware that doesn't behave as politely as an emulator. A model that can do this once under good conditions is genuinely remarkable; a model that can do it on demand, every time, would be a different and more consequential thing -- and that second claim isn't established here.


Primary source, verified: read the paper →