Learn · Intermediate

What Are Vision-Language-Action Models?

A vision-language-action (VLA) model is a single neural network that takes in camera images and a plain-language instruction and outputs the actual motor commands to carry it out. Instead of chaining together a separate vision system, a planner, and a controller, one model looks at the scene, understands "put the banana in the bowl," and emits the sequence of movements a robot arm executes. It is the same idea that made chatbots general, one big pretrained model, pointed at the problem of physical action.

Key facts

A VLA maps images + text instruction directly to robot actions, unifying perception, language, and control in one network.
The approach was crystallized by Google DeepMind's RT-2, described in Brohan et al. (2023), which built a robot policy on top of a large vision-language model.
OpenVLA, an open 7-billion-parameter model from Stanford and collaborators (Kim et al., 2024), made the recipe widely reproducible.
VLAs inherit common-sense knowledge from internet-scale pretraining, which is what lets them handle objects and phrasings they were never explicitly trained on.

To see why this is a big deal, consider how robots were programmed before. Traditionally, getting an arm to pick up a cup meant a brittle pipeline: one module to detect the cup, another to estimate its pose, a motion planner to compute a trajectory, and a controller to execute it, each hand-engineered, each a place for the whole thing to break when the lighting changed or someone moved the cup. Every new object or task meant more engineering. It did not generalize.

VLAs take the transformer recipe that powers language models and repurpose it. You start with a vision-language model, a network already trained on a huge slice of the internet's images and text, so it knows what a banana is, that bowls hold things, and that "the one on the left" refers to spatial position. Then you fine-tune it on robot data: recordings of a robot performing tasks, where the "answer" is not a sentence but the next action. Crucially, actions are encoded as tokens, the same kind of discrete units a language model predicts for words, so the model "speaks" motor commands in the exact same way it speaks text. Google DeepMind's RT-2 showed this transfer works: a robot could act on concepts it had only ever seen in web data, not in its robot training, because that knowledge was already baked into the vision-language backbone.

The analogy that helps: a VLA is like hiring someone who already understands the world and speaks your language, then teaching them to use their hands, rather than building a machine that only knows the exact motions you programmed. Because the base model has broad world knowledge, a well-trained VLA can often follow an instruction phrased in a new way, or manipulate an object it never saw in robot training, generalizing the way large language models generalize to new prompts. OpenVLA then made this practical for everyone by releasing the weights and training code, turning a Google-scale demonstration into a foundation the whole robotics field could build on.

Why it matters: VLAs are the leading bet for giving robots the kind of broad, flexible competence that language models gave software. If a single model can watch, understand, and act, then a general-purpose home or warehouse robot stops being a fantasy of endless task-specific engineering and becomes a matter of better models and more data, the same trajectory that carried chatbots from novelty to ubiquity. This is why VLAs and embodied AI agents are one of the hottest research fronts right now.

The honest caveat, and it is a live one: pouring a model's capacity into controlling a robot can erode the very knowledge that made it useful. Recent work asks whether these models still "know the basics" after being tuned for action, and finds that VLAs can lose commonsense and world understanding as they specialize on control, a trade-off you can read about in today's coverage of VLA models forgetting the basics. VLAs are also far from reliable in the open world: they still fumble unfamiliar objects, struggle with long multi-step tasks, and need large amounts of robot data to train. The promise is real, but so is the gap between a clean lab demo and a robot you would trust in your kitchen.

Key papers
Brohan et al., RT-2: Vision-Language-Action Models (2023)
Kim et al., OpenVLA: An Open-Source Vision-Language-Action Model (2024)

Key questions

What is a vision-language-action model?

It is a single neural network that reads camera images plus a written instruction and directly outputs the robot motor commands to carry out that instruction, combining perception, language understanding, and control in one model.

How is a VLA different from a normal chatbot or vision model?

A chatbot outputs text and a vision model outputs labels, but a VLA outputs actions, the actual joint movements or gripper commands a robot executes in the physical world.

Why build VLAs on top of existing vision-language models?

Because a model already trained on internet images and text has broad common sense about objects and the world, and reusing that knowledge lets robots generalize to new objects and instructions instead of learning every task from scratch.

Cite this

APA

Ground Truth. (2026, July 2). What Are Vision-Language-Action Models?. Ground Truth. https://groundtruth.day/learn/vision-language-action-models.html

BibTeX

@misc{groundtruth:vision-language-action-models,
  title  = {What Are Vision-Language-Action Models?},
  author = {{Ground Truth}},
  year   = {2026},
  month  = {jul},
  url    = {https://groundtruth.day/learn/vision-language-action-models.html}
}

Topics: robotics · multimodal · vision-language-action · embodied-ai · foundation-models