News · 2026-06-20

AI coding skill in Python doesn't carry over to other languages

When you read that an AI model is great at coding, there's a good chance the claim rests on a Python test. Python is the default language of AI research, it's everywhere in training data, and most popular coding benchmarks are written in it. That's convenient — but a new study shows it has quietly distorted our picture of how good these models really are. Stretch the test across a dozen programming languages, and the impressive Python scores turn out to be a poor guide to general coding ability.

The project, Multi-LCB, takes a respected, contamination-resistant coding benchmark that was Python-only and rebuilds the same problems in twelve different programming languages, keeping the underlying tasks equivalent so the comparison is fair. Then it runs a broad set of models across all of them. The point is simple: if a model truly understands programming, it should be able to solve the same logic puzzle whether you ask for it in Python, Java, Rust, or something more obscure. Real understanding shouldn't evaporate when the syntax changes.

Three findings stand out. First, Python overfitting: many models that look excellent in Python perform markedly worse in other languages — they've over-specialized in the language they saw most. Second, uneven contamination: the degree to which test problems appear to have leaked into a model's training varies by language, a fingerprint of how lopsided these models' training diets are toward popular languages. Third, large gaps across languages, with models especially weak in stricter, more structured languages and in less common ones that show up rarely in training data. The blunt conclusion: a model's Python performance is not a reliable stand-in for its coding ability in general.

An analogy: imagine judging someone's overall musical talent solely by how well they play one song they've practiced a thousand times. They'll sound like a virtuoso — until you hand them a new piece, or a different instrument, and discover the talent was narrower than it looked. Testing only in Python is that one over-practiced song. Multi-LCB hands the models a different instrument and listens to what actually comes out.

Why it matters: benchmarks shape everything. They decide which models look best, which research directions get funded, and which claims make headlines. If the headline coding test is single-language, the entire field is optimizing for a narrow slice of reality while telling itself the slice is the whole. Real software is written in a sprawling variety of languages, and a coding assistant that only truly shines in Python is far less useful than its leaderboard position suggests. Building tests that span many languages forces a more honest measure of general skill — and this is part of a broader reckoning this week about how AI gets evaluated, with several groups arguing that a single tidy score hides more than it reveals.

The honest caveat cuts both ways. The weaker results in less common languages might not reflect a deep inability to generalize so much as a simple shortage of training material — these models have just seen far less code in those languages. With a more balanced training diet, some of the gap might close, which would mean the problem is partly about what we feed models rather than a fundamental limit of how they learn. That's an important distinction: "can't generalize" and "wasn't taught enough" call for different fixes. Either way, the practical lesson is sturdy: the next time a model is crowned a coding champion on a Python-only test, treat the crown with suspicion. The same model handed a different language might tell a very different story.

Primary source, verified: read the paper → (arXiv 2606.20517)