Learn · Beginner
Transformers: the engine inside almost every modern AI
Almost every AI system you have heard of, ChatGPT, Claude, Gemini, the model writing this sentence, runs on the same underlying design: the transformer. It was introduced in a 2017 paper with the now-famous title "Attention Is All You Need", and it is no exaggeration to say it reorganized the entire field. Understanding it is the closest thing to understanding the machine behind modern AI.
To see why it mattered, look at what came before. Older language models read text the way you might read with a finger under each word: strictly left to right, one word at a time, carrying a running summary in memory. This is how recurrent networks (RNNs) worked, and it had two problems. First, it was slow, because step N could not start until step N-1 finished, so you could not spread the work across a chip that thrives on doing thousands of things at once. Second, it was forgetful: by the time the model reached the end of a long paragraph, the beginning had faded into a blurry summary.
The transformer threw out the finger-under-the-word approach. Its core idea, attention, lets every word look directly at every other word in the input, all at once, and decide which ones matter for understanding it.
Here is the intuition. Take the sentence "The trophy didn't fit in the suitcase because it was too big." What does "it" refer to, the trophy or the suitcase? To resolve that, the word "it" needs to pay attention to "trophy" and "big." Attention is the mechanism that lets it do exactly that: for each word, the model scores how relevant every other word is, then builds that word's meaning as a weighted blend of the words it found most relevant. Words that matter to each other get strong connections; irrelevant ones get ignored. (The technical machinery is called query-key-value, but the picture to hold is simpler: every word asks every other word "how relevant are you to me?" and mixes in the answers.)
Two refinements make it powerful. The model runs many attention operations in parallel, called multi-head attention, so different heads can specialize, one tracking grammar, another tracking who-did-what-to-whom. And because attention by itself sees a bag of words rather than an ordered sequence, the transformer adds positional information so the model still knows that "dog bites man" differs from "man bites dog."
The payoff was enormous, and a lot of it came down to hardware. Because attention compares all words simultaneously instead of marching through them one by one, the whole computation can be done in parallel, which is exactly what GPUs are built for. That unlocked training on far more data and far bigger models than RNNs ever allowed, and it is a big part of why progress accelerated so sharply (the relationship between size and capability is its own topic, covered in our lesson on scaling laws).
This design is also why several other concepts on this site exist. The reason a model can only consider so much text at once, its context window, comes straight from attention's cost: comparing every word to every other word means the work grows with the square of the input length, so doubling the text roughly quadruples the cost. The trick of activating only part of a giant model for each word, mixture of experts, is a modification bolted onto the transformer to make it cheaper to run. And the heavy one-time cost of building one of these versus the cheap-per-use cost of running it is the distinction we draw in training vs inference.
A few notes on names, because they trip people up. A "transformer" is the architecture. "GPT" stands for Generative Pretrained Transformer, and the T is this. BERT, the model that powered Google Search for years, is also a transformer, just pointed at a different job: it reads in both directions to understand text rather than generating it left to right. Same skeleton, different uses.
Attention itself was not brand new in 2017. An earlier line of machine-translation work had introduced attention in 2014 as an add-on to RNNs. The 2017 paper's radical move was right there in the title: throw away the recurrence entirely and keep only attention. That bet defined the decade of AI that followed.
If you remember one thing: the transformer's superpower is that it lets a model weigh every piece of its input against every other piece, in parallel. That single idea is what put the "large" in large language models.