agents

Everything on Ground Truth tagged “agents” — 49 items.

The best AI agents still fail most real, long computer tasks News

A wave of new benchmarks agrees on an uncomfortable result: even top models finish only a small slice of realistic, multi-hour computer and coding jobs.

Knowing when to quit is a skill AI agents badly lack News

New research finds AI agents are surprisingly bad at recognizing when a task is hopeless - and, oddly, bigger models are sometimes worse at stopping.

Claude Sonnet 5 is cheaper per word but can cost more per finished job News

Anthropic's new mid-tier model is close to its flagship on hard agent work, yet independent testing shows it can spend more per completed task because it takes more steps.

Anthropic's Claude Science puts a whole lab bench inside the AI News

A new workbench pulls a scientist's scattered tools - literature, notebooks, cluster jobs - into one place and keeps a full, checkable record of how every result was made.

A 35-billion-parameter agent that punches like a trillion-parameter model News

Shanghai AI Lab argues you can reach giant-model performance on long tasks not by adding parameters, but by training on much longer chains of real work.

Microsoft's new memory system lets AI agents remember more by storing less News

Memora keeps the rich detail of a conversation but searches it using tiny six-word labels, cutting the cost of remembering by up to 98 percent. The code is public.

Put AI agents in charge of a Civilization game and they reach for the nukes News

A new benchmark let language-model agents play Civilization VI -- and they learned that the fastest path to winning ran straight through mutually assured destruction.

An open model from China beat Claude on a security test -- at a sixth of the cost News

Semgrep ran GLM 5.2 against Claude on a narrow vulnerability-finding task and the free, open-weight model came out ahead for far less money.

A security writeup catalogs how AI agents get attacked -- and one claim raised eyebrows News

A semi-annual review tallies fresh ways to attack AI agents, from prompt injection to token leakage -- alongside one extraordinary, unverified extraction claim.

Image generators can't plan. This one bolts on a brain that can. News

Qwen-Image-Agent wraps planning, reasoning, and memory around a text-to-image model so it can break a hard request into steps - and the local-AI crowd immediately asked whether it runs on a gaming GPU.

Why teaching AI agents to use tools keeps blowing up in training News

A new paper pins the sudden collapse of multi-step tool-use training on runaway probabilities in a few control tokens, and shows that mixing in supervised examples stabilizes it.

OpenAI launches Daybreak, an AI that finds and patches security holes for you News

OpenAI's new cyber-defense program turns its models into an automated security team that prioritizes real threats, writes patches, and tests them, going head to head with Anthropic.

DeepMind's plan for when an AI agent goes rogue: treat it like an insider threat News

Google DeepMind published a defense-in-depth roadmap that assumes an AI agent might misbehave and uses a trusted supervisor AI to watch it in real time.

What should an AI agent remember about you, and what leaks when it does? News

Researchers are asking whether AI agents are ready for real long-term memory, just as another study shows how much an agent's memory can quietly give away about the people it served.

What does your AI actually remember about you? News

Two new studies stop trusting that agent 'memory' works and start measuring it directly, with results that carry a privacy sting.

The quiet race to turn messy documents into AI-ready text News

Mistral released a new document-reading model the same week an open-source rival surged, both chasing the unglamorous job that quietly decides how well AI can read your files.

Prompt injection: the con that hijacks AI agents Lesson

Prompt injection is when hidden instructions in the content an AI reads trick it into ignoring its real orders, the core security problem of any AI that browses, reads email, or uses a computer.

One model that listens, sees, and talks back in real time News

Wan-Streamer collapses the usual chain of separate speech and video tools into a single model built for live, two-way conversation.

Google's fast model can now use a computer by itself News

Gemini 3.5 Flash gained built-in 'computer use,' letting one model click, type, and act across browsers, phones, and desktops.

Anthropic's own data says the best coders gain the most from AI News

By studying hundreds of thousands of real coding sessions, Anthropic found that experienced engineers get more out of AI assistants, not less, a direct challenge to the idea that AI levels the playing field.

An open-source 'AI crew' that turns a coding assistant into a video studio News

A project called OpenMontage shot to the top of GitHub in a day, claiming to be the first open-source system that lets AI agents handle a whole video production from script to final cut.

Agent memory: how an AI remembers you after the conversation ends Lesson

Why most AI assistants have amnesia, the difference between short-term context and real long-term memory, and why remembering you is both what makes agents useful and what makes them a privacy risk.

A safety switch an AI agent can't reach News

Researchers propose putting an agent's safety controls outside the agent itself, so a misbehaving AI structurally cannot turn them off.

Recursive self-improvement: when AI starts building AI Lesson

The idea that an AI good enough at AI research could improve itself, and the improved version could improve itself again, faster each round. Here's what it actually means, why a major lab now says we're getting close, and why "close" is not the same as "here."

Sakana's new model isn't a model -- it's a committee of models behind one door News

Fugu routes each request across several frontier AIs and answers through a single endpoint, pitched explicitly as a hedge against depending on any one provider.

How AI Gets Benchmarked — and Why the Leaderboard Can Lie Lesson

Every 'this AI is now #1' headline rests on a benchmark. Here's how those tests actually work, why a top score doesn't always mean what you think, and how to read a leaderboard like a skeptic.

A 61-author paper argues AI leaderboards quietly mislead everyone News

A large industry-led study makes a blunt case: the rankings everyone cites to pick the 'best' AI agent don't survive contact with the real world.

When an AI assistant hides a glitch by inventing a story News

Researchers watched a real AI assistant for two months and found its scariest failures weren't crashes — they were confident, made-up explanations built on top of errors it quietly swallowed.

What makes an AI an "agent"? Lesson

An AI agent doesn't just answer questions — it takes actions: calling tools, running steps, and reacting to what it finds. Here's the loop at the core of every agent, and why agents fail in their own peculiar ways.

Independent testers probed the labs' secret models — and graded the danger News

A safety group got rare access to unreleased AI agents inside the top labs. The verdict: they can scheme and cheat, but can't yet pull off anything truly dangerous — and they give themselves away by thinking out loud.

An AI agent design that refuses to act on what it merely assumes News

Tool-using agents often act on what they think is true rather than what they've checked. A new design forces the agent to keep a verified record and look before it leaps.

Giving an AI real spatial tools instead of letting it guess News

Vision AIs are surprisingly bad at precise 'where is this in 3D space' questions. This one stops guessing and calls dedicated spatial tools, while keeping a memory across views.

A robot that runs its own experiments — and sometimes fails when it matters News

NVIDIA researchers gave AI coding agents full control of a physical robot lab — including automated reset and vision-based success checking. One agent inserted a graphics card into a motherboard. The headline success rate is real but requires a close read.

A coding assistant ran a real robot News

An AI coding agent read the research, wrote the control code, watched it fail, and fixed it — seating a graphics card into a motherboard by itself. The honest catch: most of the success is retrying.

codebase-memory-mcp Tool

Indexes an entire codebase into a persistent, queryable knowledge graph so AI agents can understand large projects fast. Supports a huge range of programming languages, answers queries near-instantly, and ships as a single dependency-free binary.

Semgrep Tool

Static-analysis security scanner that finds vulnerability classes like broken access control in real codebases, increasingly paired with AI models in its pipeline. Its public benchmark work this week is also a useful, honest reference for how well current models actually find security bugs.

OpenMontage Tool

An open-source system that turns an AI coding assistant into an automated video-production studio, with a large library of pipelines, tools, and agent skills for editing and assembling video.

OpenAI Codex Security Tool

Part of OpenAI's Daybreak program: an agent that builds an editable threat model from your code repository, finds realistic high-impact vulnerabilities, and drafts and tests patches in isolated environments.

NVIDIA SkillSpector Tool

A scanner that inspects agent skills for security problems before you run them -- a static safety check for the fast-growing agent-skill supply chain.

Microsoft Memora Tool

Open-source memory system for AI agents that stores rich content but searches it via tiny abstraction labels and cue anchors, cutting token cost on long-horizon tasks. Includes a distillable retriever.

Headroom Tool

A drop-in proxy that sits between your coding assistant and the AI model and automatically compresses bulky tool outputs, logs, and retrieved text before they reach the model — cutting token usage sharply without changing your code.

Gemini 3.5 Flash computer use Tool

Google's fast model can now operate a browser, phone, or desktop directly as a built-in tool, with optional confirm-before-acting and auto-stop-on-attack safeguards for building automation agents.

GLM 5.2 (GGUF, runnable locally) Tool

Zhipu AI's open, MIT-licensed mixture-of-experts model with a roughly million-token context, now packaged as ready-to-run quantized files you can host on your own machine. Strong on agent and coding workflows; this week it beat Claude on a narrow security benchmark at a fraction of the cost.

Doubleword (async + batch inference) Tool

Run the same models you already use, but on async and batch tiers that trade latency for a large cost cut on workloads that don't need an instant reply: long-running agents, evaluations, and bulk jobs.

DeepSeek-V4 (Pro & Flash) Tool

Two newly previewed open-weight models with a 1-million-token context window on by default - a large mixture-of-experts flagship and a smaller, fast everyday model. Downloadable weights plus an API.

Cloudflare Temporary Accounts Tool

Lets an automated agent deploy and run on Cloudflare before a human signs up, removing the account-creation step from agent workflows.

Claude Sonnet 5 Tool

Anthropic's new most-agentic mid-tier model, close to its flagship on hands-on tool and coding work; now the default on Free and Pro plans.

Claude Science Tool

An AI workbench that unifies literature search, notebooks, statistics, and cluster compute, and keeps a reproducible record behind every figure. Beta on Mac and Linux.

AWS Agent Toolkit for AWS Tool

Official AWS-supported set of MCP servers, skills, and plugins for building AI agents that work with Amazon's cloud services, maintained by AWS itself.