News · 2026-06-30

The best AI agents still fail most real, long computer tasks

A cluster of new benchmarks released this week shows that even the best frontier AI agents fail most of the time on realistic, long, messy tasks. On OSWorld 2.0, the top model completed only about one in five real computer workflows. On SWE-INTERACT, top models solved only about a quarter of interactive coding tasks — half their one-shot rate. Several independent teams reached the same conclusion at once.

Key facts

What: A wave of new benchmarks agrees on an uncomfortable result: even top models finish only a small slice of realistic, multi-hour computer and coding jobs.
When: 2026-06-30
Primary source: read the source (arXiv 2606.29537)

The most vivid result comes from OSWorld 2.0, a set of 108 real computer workflows built to mirror how people actually use a machine. These are not quick errands — a typical task takes a human a median of about an hour and a half and hundreds of tool actions to complete. The best model tested, running with maximum thinking effort, fully completed only about one in five of them. A rival frontier model, while more efficient with its words, plateaued even lower. The agents were fine at the mechanics — clicking, typing, writing code. They failed at the higher-level work: losing track of constraints they had been given, missing information that only appeared partway through, guessing instead of asking the user when something was ambiguous, and neglecting to check whether their own actions had worked. They did worst precisely when a task depended on recovering hidden state — figuring out something that was not spelled out.

The same pattern shows up in coding. SWE-INTERACT rebuilt software-engineering tests as interactive sessions: instead of handing the agent a complete specification upfront, a simulated user starts vague, reveals requirements bit by bit, inspects the agent's work, and gives feedback — the way real collaboration actually goes. Top models that solved about half of the traditional one-shot version solved only about a quarter of the interactive one. They suffered from what the authors call over-agentic coding — charging ahead, forgetting earlier requirements, and making avoidable mistakes when the goal was not fully clear from the start. A companion benchmark, SWE-Together, measured not just whether the agent solved the problem but how much corrective steering a simulated user had to provide along the way — turning the fuzzy sense that some agents need a lot of hand-holding into an actual number. A fourth benchmark, TUA-Bench, tested general-purpose command-line tasks and found the strongest agent managed only about two-thirds overall, with wide gaps across task types.

What unites these benchmarks is a shift in what they measure. Older benchmarks tend to hand an agent a clean, complete problem and check the final answer. Real work is nothing like that: it is long, the requirements arrive in pieces, information surfaces mid-task, and success depends on noticing when you are confused and asking rather than bluffing. These new tests deliberately recreate that mess — and the mess is exactly where current agents fall apart. This fits a theme running through this week's research, including a separate paper on whether agents even know when to stop trying, covered in knowing when to quit.

The honest caveat cuts both ways. On one side, a benchmark is a snapshot of today's models, and these systems improve fast — a twenty-percent completion rate now could climb quickly, and the value of these tests is precisely that they give the next generation a harder, more honest target to aim at. On the other side, the gap between a polished demo and reliable performance on actual, ambiguous, multi-hour work is large, and it is widest exactly where it matters most — keeping track of constraints, handling ambiguity, and checking your own work. The practical takeaway for anyone deploying agents today is to keep a human firmly in the loop on anything long or consequential, and to distrust any single flashy benchmark score. For why leaderboards can mislead in the first place, see how AI gets benchmarked.

Primary source, verified: read the paper → (arXiv 2606.29537)

Key questions

How badly do agents do on realistic tasks?

On a benchmark of long, real-world computer workflows, the best model finished only about one in five tasks fully, and on interactive coding tasks top models solved only about a quarter of jobs they could handle when given perfect instructions upfront. Agents look far stronger on tidy tests than on messy real work.

What are the agents actually failing at?

Not basic clicking or coding - they lose track of constraints, miss information that appears mid-task, guess instead of asking the user, and fail to check their own work.

Why release several benchmarks at once?

Multiple independent teams converging on the same result makes it far more credible that the gap between demo and reality is real, not a quirk of one test.

Topics: research · agents · benchmarks · computer-use · coding-agents

Comments are replies to this story on Bluesky — reply with any Bluesky account to join in.