ai-agents

Everything on Ground Truth tagged “ai-agents” — 17 items.

This model's job is to make better training data for other models News

DataClaw0 turns the grind of cleaning and labeling training data into a learned skill -- a small model that refines raw, messy multimodal streams into dense, purpose-built lessons.

Can an AI agent match real published science? A new test says: rarely News

NatureBench pits coding agents against the published state-of-the-art from Nature-family papers. Even the best agents beat the bar on a small minority of tasks -- mostly by reframing, not inventing.

Can an AI Agent Reproduce Real Science? A New Test Says: Rarely News

A new benchmark points coding agents at the actual computational results behind ninety papers in top journals. The strongest models matched the published science on fewer than one in five.

Anthropic gives AI agents their own work accounts, not yours News

Anthropic's new 'agent identity' model lets Claude agents hold their own scoped accounts for tools like GitHub and Slack, tied to channels -- instead of borrowing a human employee's login.

Anthropic Gives Its AI Agents Their Own Logins, Not Yours News

As AI agents start working in teams alongside people, the old 'the bot acts as you' model breaks down. Anthropic's answer: give each agent its own scoped account in every system it touches.

An open project publishes the recipe for training capable AI agents News

OpenThoughts-Agent releases its full data-curation pipeline, dataset, and experiments -- showing that what an agent learns from matters more than raw size, and letting anyone reproduce it.

Alibaba's new models let AI agents practice in a world they imagine News

Qwen-AgentWorld trains a model to simulate the environment an agent acts in, then uses that simulation as a cheap, controllable place to learn -- reporting gains beyond training in the real thing.

AI Agents Are Learning to Build the Worlds They Train In News

Three new open research projects point the same way: instead of only learning what to do, agents are learning to simulate the environment itself, so they can practice in their own imagination.

A Coding AI Ran Through Uber's Yearly Budget in Four Months News

Uber gave Claude Code to about 5,000 engineers, who loved it. By April the company had burned through its entire 2026 AI budget, exposing how badly old software pricing fits new agent tools.

The AI That Now Writes Most of Its Maker's Code News

Anthropic says more than 80 percent of the code it ships is now written by its own model, Claude, and the more interesting numbers are about judgment.

A Free Model That Splits Your Work Across 300 Helpers News

Moonshot AI's Kimi K2.6 is a frontier-grade model anyone can download, and its headline trick is fanning a single job out to hundreds of helpers working in parallel.

Qwen-AgentWorld Tool

Alibaba's open language world model that simulates agent environments -- browser, terminal, phone, coding workspace and more -- so other agents can be trained inside the simulation. Released with open weights and code in two sizes.

MiniMax-M3 Tool

A natively multimodal open model trained on text, image, and video from the first step, with a million-token context and a sparse-attention design built for speed; downloadable for self-hosting and also offered through MiniMax's own API and agent platform.

Kimi (Kimi K2.6) Tool

Moonshot AI's web assistant and agent, running the open-weight Kimi K2.6 model; free to use in the browser for chat and long-horizon agent tasks, with the weights also downloadable for self-hosting.

DeerFlow Tool

ByteDance's open-source agent harness that breaks a long task into specialist sub-agents running in parallel, executes code safely in sandboxes, keeps memory across sessions, and produces reports, slides, and pages; built on LangChain and works with multiple model providers.

Claude Tag (agent identity access model) Tool

Anthropic's product for putting Claude to work in shared team channels, now with an access model that gives each agent its own scoped accounts in the systems it touches -- GitHub, Slack, a data warehouse -- instead of borrowing an individual user's permissions, so every action is bounded and audited.

Claude Code Tool

Anthropic's command-line coding agent that reads a whole codebase, edits files, runs tests and fixes failures on its own; it is the tool behind Anthropic's disclosure that Claude now authors most of its production code.