Learn · Beginner
Embeddings: how AI turns words into directions in space
A computer cannot do anything with the word king as letters. It can only do arithmetic. So the first job of almost every AI system that handles language, images, or audio is to turn each piece of input into a list of numbers -- a vector. That vector is called an embedding, and the whole trick is that the numbers are not random labels. They are coordinates, placing the item at a specific point in a high-dimensional space so that where a thing sits encodes what it means.
Start with the simplest bad idea, because it shows what embeddings fix. You could give every word a unique ID -- cat is 5, dog is 6, democracy is 7. But those numbers carry no meaning: 5 and 6 being neighbors tells you nothing, and the model would have to memorize every word in isolation. Embeddings replace that with a list of, say, a few hundred or a few thousand numbers per word, and they are learned so that related things land near each other. After training, cat and dog sit close together, cat and democracy sit far apart, and the distance between two points becomes a genuine measure of similarity.
The famous demonstration that this captures real structure is vector arithmetic. In a well-trained word-embedding space, you can take the vector for king, subtract man, add woman, and land near queen. The direction that means roughly male-to-female is a consistent direction you can travel in; so is the one for country-to-capital, turning Paris minus France plus Italy into something near Rome. The model was never told these relationships. They fell out of a simple objective: predict a word from the company it keeps. This is the word2vec insight, sharpened by GloVe -- you shall know a word by the company it keeps, and if you push words that appear in similar contexts toward similar vectors, meaning organizes itself geometrically.
How are these vectors actually made? They are learned, not hand-written. The model starts with random vectors and adjusts them through training so that it gets better at some task -- predicting a missing word, or telling real word-pairs from fake ones. Every time it is wrong, it nudges the vectors a little; over billions of examples, the geometry settles into something meaningful. Early systems learned one fixed vector per word, which has an obvious flaw: bank by a river and bank with your money got the same point. Modern systems built on transformers, like BERT, produce contextual embeddings -- the vector for a word is computed fresh each time, shaped by the sentence around it, so the two banks land in different places. The static word vectors of the 2010s grew up into the context-aware representations inside today's language models.
It is worth being clear that embeddings are not just for words. The same idea -- turn a thing into a point in space where nearness means similarity -- works for whole sentences, documents, images, audio clips, even users and products. This is why embeddings quietly power so much of what you use. Semantic search finds documents by meaning rather than exact keywords, because the query and the right document land near each other even when they share no words. Recommendation systems place you and the things you might like in the same space. And retrieval-augmented generation -- giving a language model a private knowledge base to consult -- runs entirely on embeddings: you embed your documents, embed the question, and grab the nearest chunks. Note also that embeddings sit just downstream of tokenization, which first chops text into the units that get embedded.
The honest caveat is that an embedding is only ever as good as the data and objective that shaped it. The geometry inherits whatever patterns lived in the training text, biases included -- the same arithmetic that turns king into queen has been shown to encode stereotyped associations too. And nearness in the space means statistically similar in the training distribution, which is not the same as true or correct. But as a foundational idea, embeddings are hard to overstate: they are the bridge from the messy human world of words and pictures into the clean numerical world where neural networks actually compute. Almost everything else in modern AI is built on top of that bridge.